Making Your First Choice To Address Cold Start Problem in Vision Active Learning Liangyu Chen1Yutong Bai2Siyu Huang3Yongyi Lu2

2025-05-02 0 0 4.48MB 24 页 10玖币
侵权投诉
Making Your First Choice: To Address
Cold Start Problem in Vision Active Learning
Liangyu Chen1Yutong Bai2Siyu Huang3Yongyi Lu2
Bihan Wen1Alan L. Yuille2Zongwei Zhou2,
1Nanyang Technological University 2Johns Hopkins University 3Harvard University
Abstract
Active learning promises to improve annotation efficiency by iteratively selecting
the most important data to be annotated first. However, we uncover a striking
contradiction to this promise: active learning fails to select data as efficiently
as random selection at the first few choices. We identify this as the cold start
problem in vision active learning, caused by a biased and outlier initial query. This
paper seeks to address the cold start problem by exploiting the three advantages of
contrastive learning: (1) no annotation is required; (2) label diversity is ensured
by pseudo-labels to mitigate bias; (3) typical data is determined by contrastive
features to reduce outliers. Experiments are conducted on CIFAR-10-LT and three
medical imaging datasets (i.e. Colon Pathology, Abdominal CT, and Blood Cell
Microscope). Our initial query not only significantly outperforms existing active
querying strategies but also surpasses random selection by a large margin. We
foresee our solution to the cold start problem as a simple yet strong baseline to
choose the initial query for vision active learning.
Code is available: https://github.com/c-liangyu/CSVAL
1 Introduction
The secret of getting ahead is getting started.
— Mark Twain
The cold start problem was initially found in recommender systems [
56
,
39
,
9
,
23
] when algorithms
had not gathered sufficient information about users with no purchase history. It also occurred in many
other fields, such as natural language processing [
55
,
33
] and computer vision [
5
,
11
,
38
] during the
active learning procedure
1
. Active learning promises to improve annotation efficiency by iteratively
selecting the most important data to annotate. However, we uncover a striking contradiction to this
promise: Active learning fails to select data as effectively as random selection at the first choice. We
identify this as the cold start problem in vision active learning and illustrate the problem using three
medical imaging applications (Figure 1a–c) as well as a natural imaging application (Figure 1d). Cold
start is a crucial topic [
54
,
30
] because a performant initial query can lead to noticeably improved
subsequent cycle performance in the active learning procedure, evidenced in §3.3. There is a lack
of studies that systematically illustrate the cold start problem, investigate its causes, and provide
practical solutions to address it. To this end, we ask: What causes the cold start problem and how
can we select the initial query when there is no labeled data available?
Corresponding author: Zongwei Zhou (zzhou82@jh.edu)
1
Active learning aims to select the most important data from the unlabeled dataset and query human experts
to annotate new data. The newly annotated data is then added to improve the model. This process can be repeated
until the model reaches a satisfactory performance level or the annotation budget is exhausted.
Preprint. Under review.
arXiv:2210.02442v1 [cs.CV] 5 Oct 2022
(a) PathMNIST (b) OrganAMNIST
(c) BloodMNIST (d) CIFAR-10
BALD (Kirsch et al., 2019)Consistency (Gao et al., 2020)
Margin (Balcan et al., 2007)VAAL (Sinha et al., 2019)
Coreset (Sener et al., 2017)Random
Entropy (Wanget al., 2014)
Figure 1:
Cold start problem in vision active learning.
Most existing active querying strategies
(e.g. BALD, Consistency, etc.) are outperformed by random selection in selecting initial queries,
since random selection is i.i.d. to the entire dataset. However, some classes are not selected by active
querying strategies due to selection bias, so their results are not presented in the low budget regime.
Random selection is generally considered a baseline to start the active learning because the randomly
sampled query is independent and identically distributed (i.i.d.) to the entire data distribution. As is
known, maintaining a similar distribution between training and test data is beneficial, particularly
when using limited training data [
25
]. Therefore, a large body of existing work selects the initial query
randomly [
10
,
61
,
55
,
62
,
18
,
17
,
42
,
24
,
22
,
60
], highlighting that active querying compromises
accuracy and diversity compared to random sampling at the beginning of active learning [
36
,
63
,
44
,
11, 20, 59]. Why? We attribute the causes of the cold start problem to the following two aspects:
(i) Biased query: Active learning tends to select data that is biased to specific classes. Empirically,
Figure 2 reveals that the class distribution in the selected query is highly unbalanced. These active
querying strategies (e.g. Entropy, Margin, VAAL, etc.) can barely outperform random sampling at
the beginning because some classes are simply not selected for training. It is because data of the
minority classes occurs much less frequently than those of the majority classes. Moreover, datasets
in practice are often highly unbalanced, particularly in medical images [
32
,
58
]. This can escalate
the biased sampling. We hypothesize that the label diversity of a query is an important criterion to
determine the importance of the annotation. To evaluate this hypothesis theoretically, we explore
the upper bound performance by enforcing a uniform distribution using ground truth (Table 1) To
evaluate this hypothesis practically, we pursue the label diversity by exploiting the pseudo-labels
generated by K-means clustering (Table 2). The label diversity can reduce the redundancy in the
selection of majority classes, and increase the diversity by including data of minority classes.
(ii) Outlier query: Many active querying strategies were proposed to select typical data and eliminate
outliers, but they heavily rely on a trained classifier to produce predictions or features. For example,
to calculate the value of Entropy, a trained classifier is required to predict logits of the data. However,
there is no such classifier at the start of active learning, at which point no labeled data is available
for training. To express informative features for reliable predictions, we consider contrastive
learning, which can be trained using unlabeled data only. Contrastive learning encourages models to
discriminate between data augmented from the same image and data from different images [
15
,
13
].
Such a learning process is called instance discrimination. We hypothesize that instance discrimination
can act as an alternative to select typical data and eliminate outliers. Specifically, the data that
is hard to discriminate from others could be considered as typical data. With the help of Dataset
Maps [
48
,
26
]
2
, we evaluate this hypothesis and propose a novel active querying strategy that can
effectively select typical data (hard-to-contrast data in our definition, see §2.2) and reduce outliers.
2
It is worthy noting that both [
48
] and [
26
] conducted a retrospective study, which analyzed existing active
querying strategies by using the ground truth. As a result, the values of confidence and variability in the Dataset
2
Systematic ablation experiments and qualitative visualizations in §3 confirm that (i) the level of label
diversity and (ii) the inclusion of typical data are two explicit criteria for determining the annotation
importance. Naturally, contrastive learning is expected to approximate these two criteria: pseudo-
labels in clustering implicitly enforce label diversity in the query; instance discrimination determines
typical data. Extensive results show that our initial query not only significantly outperforms existing
active querying strategies, but also surpasses random selection by a large margin on three medical
imaging datasets (i.e. Colon Pathology, Abdominal CT, and Blood Cell Microscope) and two natural
imaging datasets (i.e. CIFAR-10 and CIFAR-10-LT). Our active querying strategy eliminates the
need for manual annotation to ensure the label diversity within initial queries, and more importantly,
starts the active learning procedure with the typical data.
To the best of our knowledge, we are among the first to indicate and address the cold start problem in
the field of medical image analysis (and perhaps, computer vision), making three contributions: (1)
illustrating the cold start problem in vision active learning, (2) investigating the underlying causes
with rigorous empirical analysis and visualization, and (3) determining effective initial queries for the
active learning procedure. Our solution to the cold start problem can be used as a strong yet simple
baseline to select the initial query for image classification and other vision tasks.
Related work.
When the cold start problem was first observed in recommender systems, there were
several solutions to remedy the insufficient information due to the lack of user history [
63
,
23
]. In
natural language processing (NLP), Yuan et al. [
55
] were among the first to address the cold start
problem by pre-training models using self-supervision. They attributed the cold start problem to
model instability and data scarcity. Vision active learning has shown higher performance than random
selection [
61
,
47
,
18
,
2
,
43
,
34
,
62
], but there is limited study discussing how to select the initial query
when facing the entire unlabeled dataset. A few studies somewhat indicated the existence of the cold
start problem: Lang et al. [
30
] explored the effectiveness of the
K
-center algorithm [
16
] to select the
initial queries. Similarly, Pourahmadi et al. [
38
] showed that a simple
K
-means clustering algorithm
worked fairly well at the beginning of active learning, as it was capable of covering diverse classes
and selecting a similar number of data per class. Most recently, a series of studies [
20
,
54
,
46
,
37
]
continued to propose new strategies for selecting the initial query from the entire unlabeled data
and highlighted that typical data (defined in varying ways) could significantly improve the learning
efficiency of active learning at a low budget. In addition to the existing publications, our study
justifies the two causes of the cold start problem, systematically presents the existence of the problem
in six dominant strategies, and produces a comprehensive guideline of initial query selection.
2 Method
In this section, we analyze in-depth the cause of cold start problem in two perspectives, biased query as
the inter-class query and outlier query as the intra-class factor. We provide a complementary method
to select the initial query based on both criteria. §2.1 illustrates that label diversity is a favourable
selection criterion, and discusses how we obtain label diversity via simple contrastive learning and
K
-means algorithms. §2.2 describes an unsupervised method to sample atypical (hard-to-contrast)
queries from Dataset Maps.
2.1 Inter-class Criterion: Enforcing Label Diversity to Mitigate Bias
K-means clustering.
The selected query should cover data of diverse classes, and ideally, select
similar number of data from each class. However, this requires the availability of ground truth, which
are inaccessible according to the nature of active learning. Therefore, we exploit pseudo-labels
generated by a simple
K
-means clustering algorithm and select an equal number of data from each
cluster to form the initial query to facilitate label diversity. Without knowledge about the exact
number of ground-truth classes, over-clustering is suggested in recent works [
51
,
57
] to increase
performances on the datasets with higher intra-class variance. Concretely, given 9, 11, 8 classes in
the ground truth, we set K(the number of clusters) to 30 in our experiments.
Contrastive features. K
-means clustering requires features of each data point. Li et al. [
31
]
suggested that for the purpose of clustering, contrastive methods (e.g. MoCo, SimCLR, BYOL) are
Maps could not be computed under the practical active learning setting because the ground truth is a priori
unknown. Our modified strategy, however, does not require the availability of ground truth (detailed in §2.2).
3
Random Consistency VAAL Margin Entropy Coreset BALD
adipose
background
debris
epithelium
lymphocytes
mucus
mucosa
muscle
stroma
Ours
Entropy 3.154 3.116 2.800 2.858 2.852 3.0943.006 3.122
Figure 2:
Label diversity of querying criteria.
Random, the leftmost strategy, denotes the class
distribution of randomly queried samples, which can also reflect the approximate class distribution of
the entire dataset. As seen, even with a relatively larger initial query budget (40,498 images, 45%
of the dataset), most active querying strategies are biased towards certain classes in the PathMNIST
dataset. For example, VAAL prefers selecting data in the muscle class, but largely ignores data in the
mucus and mucosa classes. On the contrary, our querying strategy selects more data from minority
classes (e.g., mucus and mucosa) while retaining the class distribution of major classes. Similar
observations in OrganAMNIST and BloodMNIST are shown in Appendix Figure 7. The higher the
entropy is, the more balanced the class distribution is.
more suitable than generative methods (e.g. colorization, reconstruction) because the contrastive
feature matrix can be naturally regarded as cluster representations. Therefore, we use MoCo v2 [
15
]—
a popular self-supervised contrastive method—to extract image features.
K
-means and MoCo v2 are certainly not the only choices for clustering and feature extraction. We
employ these two well-received methods for simplicity and efficacy in addressing the cold start
problem. Figure 2 shows our querying strategy can yield better label diversity than other six dominant
active querying strategies; similar observations are made in OrganAMNIST and BloodMNIST
(Figure 7) as well as CIFAR-10 and CIFAR-10-LT (Figure 10).
2.2 Intra-class Criterion: Querying Hard-to-Contrast Data to Avoid Outliers
Dataset map.
Given
K
clusters generated from Criterion #1, we now determine which data points
ought to be selected from each cluster. Intuitively, a data point can better represent a cluster
distribution if it is harder to contrast itself with other data points in this cluster—we consider them
typical data. To find these typical data, we modify the original Dataset Map
3
by replacing the ground
truth term with a pseudo-label term. This modification is made because ground truths are unknown in
the active learning setting but pseudo-labels are readily accessible from Criterion #1. For a visual
comparison, Figure 3b and Figure 3c present the Data Maps based on ground truths and pseudo-labels,
respectively. Formally, the modified Data Map can be formulated as follows. Let
D={xm}M
m=1
denote a dataset of Munlabeled images. Considering a minibatch of Nimages, for each image xn,
its two augmented views form a positive pair, denoted as
˜
xi
and
˜
xj
. The contrastive prediction task
on pairs of augmented images derived from the minibatch generate
2N
images, in which a true label
y
n
for an anchor augmentation is associated with its counterpart of the positive pair. We treat the
other
2(N1)
augmented images within a minibatch as negative pairs. We define the probability of
positive pair in the instance discrimination task as:
pi,j =exp(sim(zi,zj))
P2N
n=1 [n6=i]exp(sim(zi,zn)),(1)
pθ(e)(y
n|xn) = 1
2[p2n1,2n+p2n,2n1],(2)
where
sim(u,)=u>v/kukkvk
is the cosine similarity between
u
and
v
;
z2n1
and
z2n
denote the
projection head output of a positive pair for the input
xn
in a batch;
[n6=i]∈ {0,1}
is an indicator
3
Dataset Map [
12
,
48
] was proposed to analyze datasets by two measures: confidence and variability, defined
as the mean and standard deviation of the model probability of ground truth along the learning trajectory.
4
(a) Overall distribution
Easy-to-learn
Hard-to-learn
Easy-to-contrast
Hard-to-contrast
(b) Data Map by ground truth (c) Data Map by pseudo-labels
basophil
eosinophil
erythroblast
lymphocyte
monocyte
neutrophil
ig
platelet
Figure 3:
Active querying based on Dataset Maps.
(a) Dataset overview. (b) Easy- and hard-to-
learn data can be selected from the maps based on ground truths [
26
]. This querying strategy has
two limitations: it requires manual annotations and the data are stratified by classes in the 2D space,
leading to a poor label diversity in the selected queries. (c) Easy- and hard-to-contrast data can be
selected from the maps based on pseudo-labels. This querying strategy is label-free and the selected
hard-to-contrast data represent the most common patterns in the entire dataset, as presented in (a).
These data are more suitable for training, and thus alleviate the cold start problem.
function evaluating to
1
iff
n6=i
and
τ
denotes a temperature parameter.
θ(e)
denotes the parameters
at the end of the eth epoch. We define confidence (ˆµm)across Eepochs as:
ˆµm=1
E
E
X
e=1
pθ(e)(y
m|xm).(3)
The confidence (ˆµm)is the Y-axis of the Dataset Maps (see Figure 3b-c).
Hard-to-contrast data.
We consider the data with a low confidence value (Equation 3) as “hard-to-
contrast” because they are seldom predicted correctly in the instance discrimination task. Apparently,
if the model cannot distinguish a data point with others, this data point is expected to carry typical
characteristics that are shared across the dataset [
40
]. Visually, hard-to-contrast data gather in the
bottom region of the Dataset Maps and “easy-to-contrast” data gather in the top region. As expected,
hard-to-learn data are more typical, possessing the most common visual patterns as the entire dataset;
whereas easy-to-learn data appear like outliers [
54
,
26
], which may not follow the majority data
distribution (examples in Figure 3a and Figure 3c). Additionally, we also plot the original Dataset
Map [
12
,
48
] in Figure 3b, which grouped data into hard-to-learn and easy-to-learn
4
. Although
the results in §3.2 show equally compelling performance achieved by both easy-to-learn [
48
] and
hard-to-contrast data (ours), the latter do not require any manual annotation, and therefore are more
practical and suitable for vision active learning.
In summary, to meet the both criteria, our proposed active querying strategy includes three steps:
(i) extracting features by self-supervised contrastive learning, (ii) assigning clusters by
K
-means
algorithm for label diversity, and (iii) selecting hard-to-contrast data from dataset maps.
3 Experimental Results
Datasets & metrics.
Active querying strategies have a selection bias that is particularly harmful
in long-tail distributions. Therefore, unlike most existing works [
38
,
54
], which tested on highly
balanced annotated datasets, we deliberately examine our method and other baselines on long-
tail datasets to simulate real-world scenarios. Three medical datasets of different modalities
4
Swayamdipta et al. [
48
] indicated that easy-to-learn data facilitated model training in the low budget regime
because easier data reduced the confusion when the model approaching the rough decision boundary. In essence,
the advantage of easy-to-learn data in active learning aligned with the motivation of curriculum learning [6].
5
摘要:

MakingYourFirstChoice:ToAddressColdStartProbleminVisionActiveLearningLiangyuChen1YutongBai2SiyuHuang3YongyiLu2BihanWen1AlanL.Yuille2ZongweiZhou2,1NanyangTechnologicalUniversity2JohnsHopkinsUniversity3HarvardUniversityAbstractActivelearningpromisestoimproveannotationefciencybyiterativelyselectingth...

展开>> 收起<<
Making Your First Choice To Address Cold Start Problem in Vision Active Learning Liangyu Chen1Yutong Bai2Siyu Huang3Yongyi Lu2.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:4.48MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注