TiDAL Learning Training Dynamics for Active Learning Seong Min Kye1Kwanghee Choi2Hyeongmin Byun1Buru Chang3 1Hyperconnect2Carnegie Mellon University3Sogang University

2025-04-24 0 0 1.94MB 20 页 10玖币

侵权投诉

TiDAL: Learning Training Dynamics for Active Learning

Seong Min Kye1,†Kwanghee Choi2,†Hyeongmin Byun1Buru Chang3,∗

1Hyperconnect 2Carnegie Mellon University 3Sogang University

{harris,hyeongmin.byun}@hpcnt.com kwanghec@andrew.cmu.edu buru@sogang.ac.kr

Abstract

Active learning (AL) aims to select the most useful data

samples from an unlabeled data pool and annotate them to

expand the labeled dataset under a limited budget. Espe-

cially, uncertainty-based methods choose the most uncer-

tain samples, which are known to be effective in improving

model performance. However, previous methods often over-

look training dynamics (TD), deﬁned as the ever-changing

model behavior during optimization via stochastic gradi-

ent descent, even though other research areas have empir-

ically shown that TD provides important clues for measur-

ing the data uncertainty. In this paper, we ﬁrst provide the-

oretical and empirical evidence to argue the usefulness of

utilizing the ever-changing model behavior rather than the

fully trained model snapshot. We then propose a novel AL

method, Training Dynamics for Active Learning (TiDAL),

which efﬁciently predicts the training dynamics of unla-

beled data to estimate their uncertainty. Experimental re-

sults show that our TiDAL achieves better or comparable

performance on both balanced and imbalanced benchmark

datasets compared to state-of-the-art AL methods, which es-

timate data uncertainty using only static information after

model training.

1. Introduction

“There is a tide in the affairs of men. Which taken at the

ﬂood, leads on to fortune.” — William Shakespeare

Active learning (AL) [5,31] aims to solve the real-

world problem of selecting the most useful data samples

from large-scale unlabeled data pools and annotating them

to expand labeled data under a limited budget. Since the

current deep neural networks are data-hungry, AL has in-

creasingly gained attention in recent years. Existing AL

methods can be divided into two mainstream categories:

diversity- and uncertainty-based methods. Diversity-based

methods [42,14] focus on constructing a subset that fol-

†Equal contribution.

∗Corresponding author.

This work was done while all authors were afﬁliated with Hyperconnect.

Training Samples

Easy Hard

Predict

Actual TD

Certain Uncertain

𝑝(𝑦|𝑥)

If tracked Approximated by

Target

Classifier

TD Pred.

Module

TD Pred.

Module

Predicted TD

Certain

Skipped

Uncertain

Chosen

ҧ𝑝(𝑦|𝑥)

Class

Unlabeled Samples

Certain Uncertain

Training Dynamics (TD)

Epoch

Easy Hard

𝑝(𝑦∗|𝑥)

Figure 1: Our proposed TiDAL. TD of training samples

xmay differ even if they converge to the same ﬁnal pre-

dicted probability p(y∗|x)(Upper row). Hence, we are mo-

tivated to utilize the readily available rich information gen-

erated during training, i.e., leveraging TD. We estimate TD

of large-scale unlabeled data using a prediction module in-

stead of tracking the actual TD of all the unlabeled samples

to avoid the computational overhead (Lower row).

lows the target data distribution. Uncertainty-based meth-

ods [13,6,52] choose the most uncertain samples, which

are known to be effective in improving model performance.

Hence, the most critical question for the latter becomes,

“How can we quantify the data uncertainty?”

In this paper, we leverage training dynamics (TD) to

quantify data uncertainty. TD is deﬁned as the ever-

changing model behavior on each data sample during op-

timization via stochastic gradient descent. Recent stud-

ies [9,29,48,47] have provided empirical evidence that

TD provides important clues for measuring the contribution

of each data sample to model performance improvement.

Inspired by these studies, we argue that the data uncertainty

arXiv:2210.06788v3 [cs.LG] 28 Sep 2023

of unlabeled data can be estimated with TD. However, most

uncertainty-based methods quantify data uncertainty based

on static information (e.g., loss [52] or predicted probabil-

ity [45]) from a fully-trained model “snapshot,” neglecting

the valuable information generated during training. We fur-

ther argue that TD is more effective in separating uncertain

and certain data than static information from a model snap-

shot captured after model training. In §3, we provide both

theoretical and empirical evidence to support our argument

that TD is a valuable tool for quantifying data uncertainty.

Despite its huge potential, TD is not yet actively ex-

plored in the domain of AL. This is because AL assumes

a massive unlabeled data pool. Previous studies track TD

only for the training data every epoch as it can be recorded

easily during model optimization. On the other hand, AL

targets a large number of unlabeled data, where tracking

the TD for each unlabeled sample requires an impractical

amount of computation (e.g., inference all the unlabeled

samples every training epoch).

Therefore, we propose TiDAL (Training Dynamics for

Active Learning), a novel AL method that efﬁciently quan-

tiﬁes the uncertainty of unlabeled data by estimating their

TD. We avoid tracking the TD of large-scale unlabeled data

every epoch by predicting the TD of unlabeled samples with

a TD prediction module. The module is trained with the TD

of labeled data, which is readily available during model op-

timization. During the data selection phase, we predict the

TD of unlabeled data with the trained module to quantify

their uncertainties. We efﬁciently obtain TD using the mod-

ule, which avoids inferring all the unlabeled samples every

epoch. Experimental results demonstrate that our TiDAL

achieves better or comparable performance to existing AL

methods on both balanced and imbalanced datasets. Ad-

ditional analyses show that our prediction module success-

fully predicts TD, and the predicted TD is useful in estimat-

ing uncertainties of unlabeled data. Our proposed method

are illustrated in Figure 1.

Contributions of our study: (1) We bridge the concept

of training dynamics and active learning with the theoretical

and experimental evidence that training dynamics is effec-

tive in estimating data uncertainty. (2) We propose a new

method that efﬁciently predicts the training dynamics of un-

labeled data to estimate their uncertainty. (3) Our proposed

method achieves better or comparable performance on both

balanced and imbalanced benchmark datasets compared to

existing active learning methods. For reproducibility, we

release the source code1.

2. Preliminaries

To better understand our proposed method, we ﬁrst sum-

marize key concepts, including uncertainty-based active

1https://github.com/hyperconnect/TiDAL

learning, quantiﬁcation of uncertainty, and training dynam-

ics.

Uncertainty-based active learning. In this work, we fo-

cus on uncertainty-based AL for multi-class classiﬁcation

problems. We deﬁne the predicted probabilities of the given

sample xfor Cclasses as:

p= [p(1|x), p(2|x),··· , p(C|x)]T∈[0,1]C,(1)

where we denote the true label of xas yand the classiﬁer

as f.Dand Dudenote a labeled dataset and an unlabeled

data pool, respectively. The general cycle of uncertainty-

based AL is in two steps: (1) train the target classiﬁer f

on the labeled dataset Dand (2) select top-kuncertain data

samples from the unlabeled data pool Du. Selected samples

are then given to the human annotators to expand the labeled

dataset D, cycling back to the ﬁrst step.

Quantifying uncertainty. The objective of this study is to

establish a connection between the concept of TD and the

ﬁeld of AL. In order to clearly demonstrate the effective-

ness of utilizing TD to quantify data uncertainty, we have

employed two of the most prevalent and straightforward es-

timators, entropy [43] and margin [41], to measure data un-

certainty in this paper. Entropy His deﬁned as follows:

H(p) = −XC

c=1 p(c|x) log p(c|x),(2)

where the sample xis from the unlabeled data pool Du.

Entropy concentrates on the level of the model’s conﬁdence

on the given sample xand gets bigger when the prediction

across the classes becomes uniform (i.e., uncertain). Margin

Mmeasures the difference between the probability of the

true label and the maximum of the others:

M(p) = p(y|x)−max

c̸=ˆyp(c|x),(3)

where ydenotes the true label. The smaller the margin, the

lower the model’s conﬁdence in the sample, so it can be

considered uncertain. Both entropy and margin are com-

puted with the predicted probabilities pof the fully trained

classiﬁer f, only taking the snapshot of finto account.

Deﬁning training dynamics. Our TiDAL targets to lever-

age TD of unlabeled data to estimate their uncertainties. TD

can be deﬁned as any model behavior during optimization,

such as the area under the margin between logit values of

the target class and the other largest class [39] or the vari-

ance of the predicted probabilities generated at each epoch

[47]. In this work, we deﬁne the TD ¯

p(t)as the area under

the predicted probabilities of each data sample xobtained

(a) Entropy Distribution (b) Margin Distribution

Figure 2: Score distribution after long-tailed training. We

plot the marginal distributions using kernel density estima-

tion (KDE). It is difﬁcult to separate major (certain) and mi-

nor (uncertain) samples by the model snapshot-based scores

(horizontal), unlike the TD-driven scores (vertical) that en-

able clearly separating the certain and uncertain samples.

during the ttime steps of optimizing the target classiﬁer f:

p(i)= [p(i)(1|x), p(i)(2|x),··· , p(i)(C|x)]T,(4)

p(t)= [¯p(t)(1|x),¯p(t)(2|x),··· ,¯p(t)(C|x)]T

=Xτp(τ)∆τ≃Xt

i=1 p(i)/t, (5)

where p(i)is the predicted probabilities of a target classiﬁer

fat the i-th time step. ∆τis the unit time step to normal-

ize the predicted probabilities. For simplicity, we record

p(i)every epoch and choose ∆τ= 1/t, namely, averaging

the predicted probabilities during tepochs [47,46]. The

TD ¯

p(t)takes all the predicted probabilities during model

optimization into account. Hence, it encapsulates the over-

all tendency of the model during tepochs of optimization,

avoiding being solely biased towards the snapshot of p(t)in

the ﬁnal epoch t.

3. Is TD Useful for Quantifying Uncertainty?

In this section, we provide empirical and theoretical ev-

idence to support our argument: TD is more effective in

separating uncertain data from certain data than the model

snapshot, where the latter is often utilized to quantify data

uncertainty in previous works [52,45].

3.1. Motivating Observation

Settings. We aim to observe and compare the behavior of

TD and the model snapshot for different sample difﬁculties.

However, it is nontrivial to directly measure sample-wise

difﬁculty, inhibiting the quantitative analysis of data uncer-

tainty. To avoid this, we borrow the theoretical and empiri-

cal results of long-tailed visual recognition [33,8,19]: it is

hard for the deep neural network-based model to train with

fewer samples. Hence, we regard major and minor class

samples to contain many certain and uncertain samples for

the model, respectively. We train the target classiﬁer fon

the long-tailed dataset during Tepochs to obtain the TD and

the model snapshot. We apply both approaches to the com-

mon estimators, entropy and margin. We denote entropy

and margin scores from the model snapshot as Hand M.

In opposition, we denote the TD-driven scores as ¯

Hand ¯

More details and discussions are described in Appendix B.

Results. Figure 2shows the distribution the scores calcu-

lated with TD (x-axis) and model snapshot (y-axis). We can

observe that scores from TD ( ¯

H, ¯

M) successfully separate

the major and the minor class samples, whereas scores from

the model snapshot (H, M) fail to do so. We conclude that

compared to model snapshots, TD is more helpful in sepa-

rating uncertain samples from certain samples.

3.2. Theoretical Evidence

Theorem 1. (Informal) Under the LE-SDE framework

[54], with the assumption of local elasticity [17], certain

samples and uncertain samples reveal different TD; es-

pecially, certain samples converge quickly than uncertain

samples.

The above theorem discusses different model behaviors

depending on the difﬁculty of the sample. Compared to

the uncertain sample, the certain sample has the same class

samples nearby, which is the fundamental idea of level set

estimation [22] and nearest neighbor [36] literature. We

suspect that, due to the local elasticity of deep nets, samples

close by have a bigger impact on the certain sample, hence

changing its predicted probability more rapidly. As the cer-

tain sample is quicker to converge, its TD is larger than that

of the uncertain sample. Intuitively, slower to train, strug-

gling the classiﬁer is to learn, hence TD capturing the un-

certainty in the classiﬁer’s perspective.

Theorem 2. (Informal) Estimators such as Entropy (Equa-

tion 10) and Margin (Equation 11) successfully capture the

difference of TD between easy and hard samples even for

the case where it cannot be distinguished via the predicted

probabilities of the model snapshot.

The above theorem discusses the validity of entropy and

margin on whether they can successfully differentiate be-

tween two samples of different TD but with the same ﬁnal

prediction. With Theorem 1, one can conclude that the com-

mon estimators’ scores calculated with TD are effective in

capturing the data uncertainty. Due to the space constraints,

we provide the details of the above results in Appendix A.

4. Utilizing TD for Active Learning

As tracking the TD of all the unlabeled data is compu-

tationally infeasible, we devise an efﬁcient method to es-

timate the TD of unlabeled samples. We train the module

that directly predicts the TD of each sample by feeding the

training samples, where its TD are freely available during

training. Then, based on the predicted TD of each unla-

beled sample, we use the common estimators, entropy or

margin, to determine which sample is the most uncertain

so that human annotators can label it. Hence, in this sec-

tion, we describe the details of the module that estimates

TD (§4.1) and how to train the module (§4.2). Finally, cal-

culating the uncertainties using the module predictions for

active learning is illustrated (§4.3).

4.1. Training Dynamics Prediction Module

As mentioned, it is not computationally feasible to track

TD for the large-scale unlabeled data as it requires model in-

ference on all the unlabeled data every training epoch. Thus,

we propose the TD prediction module mto efﬁciently pre-

dict the TD of unlabeled data at the t-th epoch. Being inﬂu-

enced by the previous studies [11,52,45,25] that use addi-

tional modules to predict useful values such as loss or conﬁ-

dence by the target model outputs, multi-scale feature maps

are aggregated and passed into our TD prediction module.

The module produces the C-dimensional predictions:

p(t)

m= [˜p(t)

m(1|x),··· ,˜p(t)

m(C|x)]T∈[0,1]C(6)

estimating the actual TD ¯

p(t)of the given sample xin Equa-

tion 5. TD prediction module is jointly trained with the tar-

get classiﬁer using a handful of parameters, having a neg-

ligible computational cost during training. The detailed ar-

chitecture of the module is described in Appendix C.

Even though the architecture is similar to previous works

[52,45,25], we observed that ours were much more stable

during optimization and easier to train. We suspect that it

is due to the target task difference; previous works trained

the module that outputs only a single value via regression,

whereas our module outputs C-dimensional probability dis-

tribution, which is similar to the main task of classifying

images.

4.2. Training Objectives

To train the target classiﬁer fat the t-th epoch, we

use the cross-entropy loss function Ltarget on the predicted

probability p(t)and a one-hot encoded vector y∈ {0,1}C

of the true label y:

Ltarget =LCE(p(t),y) = −log p(t)(y|x).(7)

Meanwhile, the prediction module mlearns the TD of a

sample xby minimizing the Kullback–Leibler (KL) diver-

gence between the predicted TD ˜

p(t)

mand the actual TD ¯

p(t):

Lmodule =LKL(¯

p(t)||˜

p(t)

=XC

c=1 ¯p(t)(c|x) log ¯p(t)(c|x)

˜p(t)

m(c|x)!.(8)

The ﬁnal objective function of our proposed method is de-

ﬁned as follows:

L=Ltarget +λLmodule (9)

where λis a balancing factor to control the effect of Lmodule

during model training.

4.3. Quantifying Uncertainty with TD

We argue that uncertain samples can be effectively dis-

tinguished from unlabeled data using the predicted TD. To

verify the effectiveness of leveraging TD, we feed the pre-

dicted TD to entropy and margin (§2) by replacing snapshot

probability pwith the predicted TD ¯

p. We choose these es-

timators as they are widely used for quantifying uncertainty.

We feed ¯

p, replacing p, to the entropy ¯

H(¯

p) = −XC

c=1 ¯p(c|x) log ¯p(c|x).(10)

Entropy ¯

His maximized when ¯

pis uniform, i.e., the sam-

ple is uncertain for the target classiﬁer. Margin ¯

Mis also

similarly employed:

M(¯

p) = ¯p(ˆy|x)−max

c̸=ˆy¯p(c|x).(11)

Since we do not have true labels of unlabeled samples, we

use the predicted labels ˆyof the target classiﬁer instead of

the true labels. There are several possible variants of ¯

depending on the deﬁnition of ˆy. We conduct experiments

to compare ¯

Mwith its variants. The experimental details

and results are in Appendix D.4.

At the data selection phase, we use the predicted TD ˜

p(T)

instead of the actual TD ¯

p(T)as in Equation 10 &11 to es-

timate the TD-driven uncertainties of the unlabeled sample

xat the ﬁnal epoch T. By using the estimated uncertainty

with the predicted TD, we select the most informative sam-

ples for model training.

5. Experiments

In this section, we experimentally verify the effective-

ness of our method, TiDAL, which utilizes the estimated

training dynamics from the prediction module to discern

uncertain samples from unlabeled data. We describe the de-

tailed settings and the baseline methods for our experiments

(§5.1) and show the results on both balanced (§5.2) and im-

balanced datasets (§5.3). We further analyze whether the

TD prediction module is effective for AL performance and

can successfully estimate the TD (§5.4). We end the section

by discussing the potential limitations of our method (§5.5).

5.1. Experimental Setup

Datasets. To assess the performance of our proposed

method and baseline methods, we conduct experiments on

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TiDAL:LearningTrainingDynamicsforActiveLearningSeongMinKye1,†KwangheeChoi2,†HyeongminByun1BuruChang3,∗1Hyperconnect2CarnegieMellonUniversity3SogangUniversity{harris,hyeongmin.byun}@hpcnt.comkwanghec@andrew.cmu.eduburu@sogang.ac.krAbstractActivelearning(AL)aimstoselectthemostusefuldatasamplesfromanun...

展开>> 收起<<

TiDAL Learning Training Dynamics for Active Learning Seong Min Kye1Kwanghee Choi2Hyeongmin Byun1Buru Chang3 1Hyperconnect2Carnegie Mellon University3Sogang University.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TiDAL Learning Training Dynamics for Active Learning Seong Min Kye1Kwanghee Choi2Hyeongmin Byun1Buru Chang3 1Hyperconnect2Carnegie Mellon University3Sogang University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: