
In this paper, we present CEREAL (Cluster Evaluation with REstricted Availability of Labels), a
comprehensive framework for few-sample clustering evaluation without any explicit assumptions on
the evaluation metric or clustering algorithm. We propose several improvements to the standard active
sampling pipeline. First, we derive acquisition functions based on normalized mutual information, a
popular evaluation metric for clustering. The choice of acquisition function depends on whether the
clustering algorithm returns a cluster assignment (hard clustering) or a distribution over clusters (soft
clustering). Then, we use a semi-supervised learning algorithm to train the surrogate model with both
labeled and unlabeled data. Finally, we pseudo-label the unlabeled data with the learned surrogate
model before estimating the evaluation metric.
Our experiments across multiple real-world datasets, clustering algorithms, and evaluation metrics
show that CEREAL accurately and reliably estimates the clustering quality much better than several
baselines. Our results show that CEREAL reduces the area under the absolute error curve (AEC) up to
57% compared to uniform sampling. In fact, CEREAL reduces the AEC up to 65% compared to the
best performing active sampling method, which typically produces biased underestimates of NMI. In
an extensive ablation study we observe that the combination of semi-supervised learning and pseudo-
labeling is crucial for optimal performance as each component on its own might hurt performance
(see Table 1). We also validate the robustness of our framework across multiple clustering algorithms
– namely K-Means, spectral clustering, and BIRCH – to estimate a wide range evaluation metrics –
namely normalized mutual information (NMI), adjusted mutual information (AMI), and adjusted
rand index (ARI). Finally, we show that CEREAL can be extended from clusterwise annotations
to pairwise annotations by using the surrogate model to pseudo-label the dataset. Our results with
pairwise annotations show that pseudo-labeling can approximate the evaluation metric but requires
significantly more annotations than clusterwise annotations to achieve similar estimates.
We summarize our contributions as follows:
•
We introduce CEREAL, a framework for few-sample clustering evaluation. To the best of
our knowledge, we are the first to investigate the problem of evaluating clustering with
a limited labeling budget. Our solution uses a novel combination of active sampling and
semi-supervised learning, including new NMI-based acquisition functions.
•
Our experiments in the active sampling pipeline show that CEREAL almost always achieves
the lowest AEC across language and vision datasets. We also show that our framework reli-
ably estimates the quality of the clustering across different clustering algorithms, evaluation
metrics, and annotation types.
2 RELATED WORK
Cluster Evaluation
The trade-offs associated with different types of clustering evaluation are
well-studied in the literature (Rousseeuw, 1987; Rosenberg & Hirschberg, 2007; Vinh et al., 2010;
G
¨
osgens et al., 2021). Clustering evaluation metrics - oftentimes referred to as validation indices -
are either internal or external. Internal evaluation metrics gauge the quality of a clustering without
supervision and instead rely on the geometric properties of the clusters. However, they might not be
reliable as they do not account for the downstream task or make clustering specific assumptions (von
Luxburg et al., 2012; G
¨
osgens et al., 2021; Mishra et al., 2022). On the other hand, external evaluation
metrics require supervision, oftentimes in the form of ground truth annotations. Commonly used
external evaluation metrics are adjusted Rand index (Hubert & Arabie, 1985), V-measure (Rosenberg
& Hirschberg, 2007), and mutual information (Cover & Thomas, 2006) along with its normalized
and adjusted variants. We aim to estimate external evaluation metrics for a clustering with limited
labels for the ground truth or the reference clustering. Recently, Mishra et al. (2022) proposed a
framework to select the expected best clustering achievable given a hyperparameter tuning method
and a computation budget. Our work complements theirs by choosing best clustering under a given
labeling budget.
Few-sample Evaluation
Few-sample evaluation seeks to test model performance with limited
access to human annotations (Hutchinson et al., 2022). These approaches rely on different sampling
strategies such as stratified sampling (Kumar & Raj, 2018), importance sampling (Sawade et al.,
2010; Poms et al., 2021), and active sampling (Kossen et al., 2021). Our work is closely related
to few-sample model evaluation but focuses on clustering. Often, existing approaches tailor their
2