
Systematic ablation experiments and qualitative visualizations in §3 confirm that (i) the level of label
diversity and (ii) the inclusion of typical data are two explicit criteria for determining the annotation
importance. Naturally, contrastive learning is expected to approximate these two criteria: pseudo-
labels in clustering implicitly enforce label diversity in the query; instance discrimination determines
typical data. Extensive results show that our initial query not only significantly outperforms existing
active querying strategies, but also surpasses random selection by a large margin on three medical
imaging datasets (i.e. Colon Pathology, Abdominal CT, and Blood Cell Microscope) and two natural
imaging datasets (i.e. CIFAR-10 and CIFAR-10-LT). Our active querying strategy eliminates the
need for manual annotation to ensure the label diversity within initial queries, and more importantly,
starts the active learning procedure with the typical data.
To the best of our knowledge, we are among the first to indicate and address the cold start problem in
the field of medical image analysis (and perhaps, computer vision), making three contributions: (1)
illustrating the cold start problem in vision active learning, (2) investigating the underlying causes
with rigorous empirical analysis and visualization, and (3) determining effective initial queries for the
active learning procedure. Our solution to the cold start problem can be used as a strong yet simple
baseline to select the initial query for image classification and other vision tasks.
Related work.
When the cold start problem was first observed in recommender systems, there were
several solutions to remedy the insufficient information due to the lack of user history [
63
,
23
]. In
natural language processing (NLP), Yuan et al. [
55
] were among the first to address the cold start
problem by pre-training models using self-supervision. They attributed the cold start problem to
model instability and data scarcity. Vision active learning has shown higher performance than random
selection [
61
,
47
,
18
,
2
,
43
,
34
,
62
], but there is limited study discussing how to select the initial query
when facing the entire unlabeled dataset. A few studies somewhat indicated the existence of the cold
start problem: Lang et al. [
30
] explored the effectiveness of the
K
-center algorithm [
16
] to select the
initial queries. Similarly, Pourahmadi et al. [
38
] showed that a simple
K
-means clustering algorithm
worked fairly well at the beginning of active learning, as it was capable of covering diverse classes
and selecting a similar number of data per class. Most recently, a series of studies [
20
,
54
,
46
,
37
]
continued to propose new strategies for selecting the initial query from the entire unlabeled data
and highlighted that typical data (defined in varying ways) could significantly improve the learning
efficiency of active learning at a low budget. In addition to the existing publications, our study
justifies the two causes of the cold start problem, systematically presents the existence of the problem
in six dominant strategies, and produces a comprehensive guideline of initial query selection.
2 Method
In this section, we analyze in-depth the cause of cold start problem in two perspectives, biased query as
the inter-class query and outlier query as the intra-class factor. We provide a complementary method
to select the initial query based on both criteria. §2.1 illustrates that label diversity is a favourable
selection criterion, and discusses how we obtain label diversity via simple contrastive learning and
K
-means algorithms. §2.2 describes an unsupervised method to sample atypical (hard-to-contrast)
queries from Dataset Maps.
2.1 Inter-class Criterion: Enforcing Label Diversity to Mitigate Bias
K-means clustering.
The selected query should cover data of diverse classes, and ideally, select
similar number of data from each class. However, this requires the availability of ground truth, which
are inaccessible according to the nature of active learning. Therefore, we exploit pseudo-labels
generated by a simple
K
-means clustering algorithm and select an equal number of data from each
cluster to form the initial query to facilitate label diversity. Without knowledge about the exact
number of ground-truth classes, over-clustering is suggested in recent works [
51
,
57
] to increase
performances on the datasets with higher intra-class variance. Concretely, given 9, 11, 8 classes in
the ground truth, we set K(the number of clusters) to 30 in our experiments.
Contrastive features. K
-means clustering requires features of each data point. Li et al. [
31
]
suggested that for the purpose of clustering, contrastive methods (e.g. MoCo, SimCLR, BYOL) are
Maps could not be computed under the practical active learning setting because the ground truth is a priori
unknown. Our modified strategy, however, does not require the availability of ground truth (detailed in §2.2).
3