Making Your First Choice To Address Cold Start Problem in Vision Active Learning Liangyu Chen1Yutong Bai2Siyu Huang3Yongyi Lu2

2025-05-02 0 0 4.48MB 24 页 10玖币

侵权投诉

Making Your First Choice: To Address

Cold Start Problem in Vision Active Learning

Liangyu Chen1Yutong Bai2Siyu Huang3Yongyi Lu2

Bihan Wen1Alan L. Yuille2Zongwei Zhou2,∗

1Nanyang Technological University 2Johns Hopkins University 3Harvard University

Abstract

Active learning promises to improve annotation efﬁciency by iteratively selecting

the most important data to be annotated ﬁrst. However, we uncover a striking

contradiction to this promise: active learning fails to select data as efﬁciently

as random selection at the ﬁrst few choices. We identify this as the cold start

problem in vision active learning, caused by a biased and outlier initial query. This

paper seeks to address the cold start problem by exploiting the three advantages of

contrastive learning: (1) no annotation is required; (2) label diversity is ensured

by pseudo-labels to mitigate bias; (3) typical data is determined by contrastive

features to reduce outliers. Experiments are conducted on CIFAR-10-LT and three

medical imaging datasets (i.e. Colon Pathology, Abdominal CT, and Blood Cell

Microscope). Our initial query not only signiﬁcantly outperforms existing active

querying strategies but also surpasses random selection by a large margin. We

foresee our solution to the cold start problem as a simple yet strong baseline to

choose the initial query for vision active learning.

Code is available: https://github.com/c-liangyu/CSVAL

1 Introduction

“The secret of getting ahead is getting started.”

— Mark Twain

The cold start problem was initially found in recommender systems [

] when algorithms

had not gathered sufﬁcient information about users with no purchase history. It also occurred in many

other ﬁelds, such as natural language processing [

] and computer vision [

] during the

active learning procedure

. Active learning promises to improve annotation efﬁciency by iteratively

selecting the most important data to annotate. However, we uncover a striking contradiction to this

promise: Active learning fails to select data as effectively as random selection at the ﬁrst choice. We

identify this as the cold start problem in vision active learning and illustrate the problem using three

medical imaging applications (Figure 1a–c) as well as a natural imaging application (Figure 1d). Cold

start is a crucial topic [

] because a performant initial query can lead to noticeably improved

subsequent cycle performance in the active learning procedure, evidenced in §3.3. There is a lack

of studies that systematically illustrate the cold start problem, investigate its causes, and provide

practical solutions to address it. To this end, we ask: What causes the cold start problem and how

can we select the initial query when there is no labeled data available?

∗Corresponding author: Zongwei Zhou (zzhou82@jh.edu)

Active learning aims to select the most important data from the unlabeled dataset and query human experts

to annotate new data. The newly annotated data is then added to improve the model. This process can be repeated

until the model reaches a satisfactory performance level or the annotation budget is exhausted.

Preprint. Under review.

arXiv:2210.02442v1 [cs.CV] 5 Oct 2022

(a) PathMNIST (b) OrganAMNIST

BALD (Kirsch et al., 2019)Consistency (Gao et al., 2020)

Margin (Balcan et al., 2007)VAAL (Sinha et al., 2019)

Coreset (Sener et al., 2017)Random

Entropy (Wanget al., 2014)

Figure 1:

Cold start problem in vision active learning.

Most existing active querying strategies

(e.g. BALD, Consistency, etc.) are outperformed by random selection in selecting initial queries,

since random selection is i.i.d. to the entire dataset. However, some classes are not selected by active

querying strategies due to selection bias, so their results are not presented in the low budget regime.

Random selection is generally considered a baseline to start the active learning because the randomly

sampled query is independent and identically distributed (i.i.d.) to the entire data distribution. As is

known, maintaining a similar distribution between training and test data is beneﬁcial, particularly

when using limited training data [

]. Therefore, a large body of existing work selects the initial query

randomly [

], highlighting that active querying compromises

accuracy and diversity compared to random sampling at the beginning of active learning [

11, 20, 59]. Why? We attribute the causes of the cold start problem to the following two aspects:

(i) Biased query: Active learning tends to select data that is biased to speciﬁc classes. Empirically,

Figure 2 reveals that the class distribution in the selected query is highly unbalanced. These active

querying strategies (e.g. Entropy, Margin, VAAL, etc.) can barely outperform random sampling at

the beginning because some classes are simply not selected for training. It is because data of the

minority classes occurs much less frequently than those of the majority classes. Moreover, datasets

in practice are often highly unbalanced, particularly in medical images [

]. This can escalate

the biased sampling. We hypothesize that the label diversity of a query is an important criterion to

determine the importance of the annotation. To evaluate this hypothesis theoretically, we explore

the upper bound performance by enforcing a uniform distribution using ground truth (Table 1) To

evaluate this hypothesis practically, we pursue the label diversity by exploiting the pseudo-labels

generated by K-means clustering (Table 2). The label diversity can reduce the redundancy in the

selection of majority classes, and increase the diversity by including data of minority classes.

(ii) Outlier query: Many active querying strategies were proposed to select typical data and eliminate

outliers, but they heavily rely on a trained classiﬁer to produce predictions or features. For example,

to calculate the value of Entropy, a trained classiﬁer is required to predict logits of the data. However,

there is no such classiﬁer at the start of active learning, at which point no labeled data is available

for training. To express informative features for reliable predictions, we consider contrastive

learning, which can be trained using unlabeled data only. Contrastive learning encourages models to

discriminate between data augmented from the same image and data from different images [

Such a learning process is called instance discrimination. We hypothesize that instance discrimination

can act as an alternative to select typical data and eliminate outliers. Speciﬁcally, the data that

is hard to discriminate from others could be considered as typical data. With the help of Dataset

Maps [

]

, we evaluate this hypothesis and propose a novel active querying strategy that can

effectively select typical data (hard-to-contrast data in our deﬁnition, see §2.2) and reduce outliers.

It is worthy noting that both [

] and [

] conducted a retrospective study, which analyzed existing active

querying strategies by using the ground truth. As a result, the values of conﬁdence and variability in the Dataset

Systematic ablation experiments and qualitative visualizations in §3 conﬁrm that (i) the level of label

diversity and (ii) the inclusion of typical data are two explicit criteria for determining the annotation

importance. Naturally, contrastive learning is expected to approximate these two criteria: pseudo-

labels in clustering implicitly enforce label diversity in the query; instance discrimination determines

typical data. Extensive results show that our initial query not only signiﬁcantly outperforms existing

active querying strategies, but also surpasses random selection by a large margin on three medical

imaging datasets (i.e. Colon Pathology, Abdominal CT, and Blood Cell Microscope) and two natural

imaging datasets (i.e. CIFAR-10 and CIFAR-10-LT). Our active querying strategy eliminates the

need for manual annotation to ensure the label diversity within initial queries, and more importantly,

starts the active learning procedure with the typical data.

To the best of our knowledge, we are among the ﬁrst to indicate and address the cold start problem in

the ﬁeld of medical image analysis (and perhaps, computer vision), making three contributions: (1)

illustrating the cold start problem in vision active learning, (2) investigating the underlying causes

with rigorous empirical analysis and visualization, and (3) determining effective initial queries for the

active learning procedure. Our solution to the cold start problem can be used as a strong yet simple

baseline to select the initial query for image classiﬁcation and other vision tasks.

Related work.

When the cold start problem was ﬁrst observed in recommender systems, there were

several solutions to remedy the insufﬁcient information due to the lack of user history [

]. In

natural language processing (NLP), Yuan et al. [

] were among the ﬁrst to address the cold start

problem by pre-training models using self-supervision. They attributed the cold start problem to

model instability and data scarcity. Vision active learning has shown higher performance than random

selection [

], but there is limited study discussing how to select the initial query

when facing the entire unlabeled dataset. A few studies somewhat indicated the existence of the cold

start problem: Lang et al. [

] explored the effectiveness of the

-center algorithm [

] to select the

initial queries. Similarly, Pourahmadi et al. [

] showed that a simple

-means clustering algorithm

worked fairly well at the beginning of active learning, as it was capable of covering diverse classes

and selecting a similar number of data per class. Most recently, a series of studies [

]

continued to propose new strategies for selecting the initial query from the entire unlabeled data

and highlighted that typical data (deﬁned in varying ways) could signiﬁcantly improve the learning

efﬁciency of active learning at a low budget. In addition to the existing publications, our study

justiﬁes the two causes of the cold start problem, systematically presents the existence of the problem

in six dominant strategies, and produces a comprehensive guideline of initial query selection.

2 Method

In this section, we analyze in-depth the cause of cold start problem in two perspectives, biased query as

the inter-class query and outlier query as the intra-class factor. We provide a complementary method

to select the initial query based on both criteria. §2.1 illustrates that label diversity is a favourable

selection criterion, and discusses how we obtain label diversity via simple contrastive learning and

-means algorithms. §2.2 describes an unsupervised method to sample atypical (hard-to-contrast)

queries from Dataset Maps.

2.1 Inter-class Criterion: Enforcing Label Diversity to Mitigate Bias

K-means clustering.

The selected query should cover data of diverse classes, and ideally, select

similar number of data from each class. However, this requires the availability of ground truth, which

are inaccessible according to the nature of active learning. Therefore, we exploit pseudo-labels

generated by a simple

-means clustering algorithm and select an equal number of data from each

cluster to form the initial query to facilitate label diversity. Without knowledge about the exact

number of ground-truth classes, over-clustering is suggested in recent works [

] to increase

performances on the datasets with higher intra-class variance. Concretely, given 9, 11, 8 classes in

the ground truth, we set K(the number of clusters) to 30 in our experiments.

Contrastive features. K

-means clustering requires features of each data point. Li et al. [

]

suggested that for the purpose of clustering, contrastive methods (e.g. MoCo, SimCLR, BYOL) are

Maps could not be computed under the practical active learning setting because the ground truth is a priori

unknown. Our modiﬁed strategy, however, does not require the availability of ground truth (detailed in §2.2).

Random Consistency VAAL Margin Entropy Coreset BALD

adipose

background

debris

epithelium

lymphocytes

mucus

mucosa

muscle

stroma

Ours

Entropy 3.154 3.116 2.800 2.858 2.852 3.0943.006 3.122

Figure 2:

Label diversity of querying criteria.

Random, the leftmost strategy, denotes the class

distribution of randomly queried samples, which can also reﬂect the approximate class distribution of

the entire dataset. As seen, even with a relatively larger initial query budget (40,498 images, 45%

of the dataset), most active querying strategies are biased towards certain classes in the PathMNIST

dataset. For example, VAAL prefers selecting data in the muscle class, but largely ignores data in the

mucus and mucosa classes. On the contrary, our querying strategy selects more data from minority

classes (e.g., mucus and mucosa) while retaining the class distribution of major classes. Similar

observations in OrganAMNIST and BloodMNIST are shown in Appendix Figure 7. The higher the

entropy is, the more balanced the class distribution is.

more suitable than generative methods (e.g. colorization, reconstruction) because the contrastive

feature matrix can be naturally regarded as cluster representations. Therefore, we use MoCo v2 [

]—

a popular self-supervised contrastive method—to extract image features.

-means and MoCo v2 are certainly not the only choices for clustering and feature extraction. We

employ these two well-received methods for simplicity and efﬁcacy in addressing the cold start

problem. Figure 2 shows our querying strategy can yield better label diversity than other six dominant

active querying strategies; similar observations are made in OrganAMNIST and BloodMNIST

(Figure 7) as well as CIFAR-10 and CIFAR-10-LT (Figure 10).

2.2 Intra-class Criterion: Querying Hard-to-Contrast Data to Avoid Outliers

Dataset map.

Given

clusters generated from Criterion #1, we now determine which data points

ought to be selected from each cluster. Intuitively, a data point can better represent a cluster

distribution if it is harder to contrast itself with other data points in this cluster—we consider them

typical data. To ﬁnd these typical data, we modify the original Dataset Map

by replacing the ground

truth term with a pseudo-label term. This modiﬁcation is made because ground truths are unknown in

the active learning setting but pseudo-labels are readily accessible from Criterion #1. For a visual

comparison, Figure 3b and Figure 3c present the Data Maps based on ground truths and pseudo-labels,

respectively. Formally, the modiﬁed Data Map can be formulated as follows. Let

D={xm}M

m=1

denote a dataset of Munlabeled images. Considering a minibatch of Nimages, for each image xn,

its two augmented views form a positive pair, denoted as

and

. The contrastive prediction task

on pairs of augmented images derived from the minibatch generate

images, in which a true label

y∗

for an anchor augmentation is associated with its counterpart of the positive pair. We treat the

other

2(N−1)

augmented images within a minibatch as negative pairs. We deﬁne the probability of

positive pair in the instance discrimination task as:

pi,j =exp(sim(zi,zj))/τ

P2N

n=1 [n6=i]exp(sim(zi,zn))/τ ,(1)

pθ(e)(y∗

n|xn) = 1

2[p2n−1,2n+p2n,2n−1],(2)

where

sim(u,)=u>v/kukkvk

is the cosine similarity between

and

;

z2n−1

and

z2n

denote the

projection head output of a positive pair for the input

in a batch;

[n6=i]∈ {0,1}

is an indicator

Dataset Map [

] was proposed to analyze datasets by two measures: conﬁdence and variability, deﬁned

as the mean and standard deviation of the model probability of ground truth along the learning trajectory.

(a) Overall distribution

Easy-to-learn

Hard-to-learn

Easy-to-contrast

Hard-to-contrast

(b) Data Map by ground truth (c) Data Map by pseudo-labels

basophil

eosinophil

erythroblast

lymphocyte

monocyte

neutrophil

platelet

Figure 3:

Active querying based on Dataset Maps.

(a) Dataset overview. (b) Easy- and hard-to-

learn data can be selected from the maps based on ground truths [

]. This querying strategy has

two limitations: it requires manual annotations and the data are stratiﬁed by classes in the 2D space,

leading to a poor label diversity in the selected queries. (c) Easy- and hard-to-contrast data can be

selected from the maps based on pseudo-labels. This querying strategy is label-free and the selected

hard-to-contrast data represent the most common patterns in the entire dataset, as presented in (a).

These data are more suitable for training, and thus alleviate the cold start problem.

function evaluating to

iff

n6=i

and

denotes a temperature parameter.

θ(e)

denotes the parameters

at the end of the eth epoch. We deﬁne conﬁdence (ˆµm)across Eepochs as:

ˆµm=1

e=1

pθ(e)(y∗

m|xm).(3)

The conﬁdence (ˆµm)is the Y-axis of the Dataset Maps (see Figure 3b-c).

Hard-to-contrast data.

We consider the data with a low conﬁdence value (Equation 3) as “hard-to-

contrast” because they are seldom predicted correctly in the instance discrimination task. Apparently,

if the model cannot distinguish a data point with others, this data point is expected to carry typical

characteristics that are shared across the dataset [

]. Visually, hard-to-contrast data gather in the

bottom region of the Dataset Maps and “easy-to-contrast” data gather in the top region. As expected,

hard-to-learn data are more typical, possessing the most common visual patterns as the entire dataset;

whereas easy-to-learn data appear like outliers [

], which may not follow the majority data

distribution (examples in Figure 3a and Figure 3c). Additionally, we also plot the original Dataset

Map [

] in Figure 3b, which grouped data into hard-to-learn and easy-to-learn

. Although

the results in §3.2 show equally compelling performance achieved by both easy-to-learn [

] and

hard-to-contrast data (ours), the latter do not require any manual annotation, and therefore are more

practical and suitable for vision active learning.

In summary, to meet the both criteria, our proposed active querying strategy includes three steps:

(i) extracting features by self-supervised contrastive learning, (ii) assigning clusters by

-means

algorithm for label diversity, and (iii) selecting hard-to-contrast data from dataset maps.

3 Experimental Results

Datasets & metrics.

Active querying strategies have a selection bias that is particularly harmful

in long-tail distributions. Therefore, unlike most existing works [

], which tested on highly

balanced annotated datasets, we deliberately examine our method and other baselines on long-

tail datasets to simulate real-world scenarios. Three medical datasets of different modalities

Swayamdipta et al. [

] indicated that easy-to-learn data facilitated model training in the low budget regime

because easier data reduced the confusion when the model approaching the rough decision boundary. In essence,

the advantage of easy-to-learn data in active learning aligned with the motivation of curriculum learning [6].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MakingYourFirstChoice:ToAddressColdStartProbleminVisionActiveLearningLiangyuChen1YutongBai2SiyuHuang3YongyiLu2BihanWen1AlanL.Yuille2ZongweiZhou2,1NanyangTechnologicalUniversity2JohnsHopkinsUniversity3HarvardUniversityAbstractActivelearningpromisestoimproveannotationefciencybyiterativelyselectingth...

展开>> 收起<<

Making Your First Choice To Address Cold Start Problem in Vision Active Learning Liangyu Chen1Yutong Bai2Siyu Huang3Yongyi Lu2.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Making Your First Choice To Address Cold Start Problem in Vision Active Learning Liangyu Chen1Yutong Bai2Siyu Huang3Yongyi Lu2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: