CEREAL FEW-SAMPLE CLUSTERING EVALUATION Nihal V . Nayaky Ethan R. Elenbergx Clemens Rosenbaumx yDepartment of Computer Science Brown University_2

2025-04-30 0 0 426.75KB 14 页 10玖币
侵权投诉
CEREAL: FEW-SAMPLE CLUSTERING EVALUATION
Nihal V. Nayak, Ethan R. Elenberg§, Clemens Rosenbaum§
Department of Computer Science, Brown University
{nnayak2}@cs.brown.edu
§ASAPP Inc
{eelenberg, crosenbaum}@asapp.com
ABSTRACT
Evaluating clustering quality with reliable evaluation metrics like normalized mu-
tual information (NMI) requires labeled data that can be expensive to annotate. We
focus on the underexplored problem of estimating clustering quality with limited
labels. We adapt existing approaches from the few-sample model evaluation litera-
ture to actively sub-sample, with a learned surrogate model, the most informative
data points for annotation to estimate the evaluation metric. However, we find that
their estimation can be biased and only relies on the labeled data. To that end, we
introduce CEREAL, a comprehensive framework for few-sample clustering evalua-
tion that extends active sampling approaches in three key ways. First, we propose
novel NMI-based acquisition functions that account for the distinctive properties
of clustering and uncertainties from a learned surrogate model. Next, we use ideas
from semi-supervised learning and train the surrogate model with both the labeled
and unlabeled data. Finally, we pseudo-label the unlabeled data with the surrogate
model. We run experiments to estimate NMI in an active sampling pipeline on
three datasets across vision and language. Our results show that CEREAL reduces
the area under the absolute error curve by up to 57% compared to the best sampling
baseline. We perform an extensive ablation study to show that our framework
is agnostic to the choice of clustering algorithm and evaluation metric. We also
extend CEREAL from clusterwise annotations to pairwise annotations. Overall,
CEREAL can efficiently evaluate clustering with limited human annotations.
1 INTRODUCTION
Unsupervised clustering algorithms (Jain et al., 1999) partition a given dataset into meaningful groups
such that similar data points belong to the same cluster. Typically, data science and machine learning
practitioners can obtain a wide range of output clusterings by varying algorithm parameters such as
number of clusters and minimum inter-cluster distance (Mishra et al., 2022). However, evaluating
these clusterings can be challenging. Unsupervised evaluation metrics, such as Silhouette Index,
often do not correlate well with downstream performance (von Luxburg et al., 2012). On the other
hand, supervised evaluation metrics such as normalized mutual information (NMI) and adjusted Rand
index (ARI) require a labeled reference clustering. This supervised evaluation step introduces a costly
bottleneck which limits the applicability of clustering for exploratory data analysis. In this work, we
study an underexplored area of research: estimating the clustering quality with limited annotations.
Existing work on this problem, adapted from few-sample model evaluation, can often perform worse
than uniform random sampling (see Section 5). These works use learned surrogate models such
as multilayer perceptrons to identify the most informative unlabeled data from the evaluation set.
Similar to active learning, they then iteratively rank the next samples to be labeled according to an
acquisition function and the surrogate model’s predictions. However, many active sampling methods
derive acquisition functions tailored to a specific classification or regression metric (Sawade et al.,
2010; Kossen et al., 2021), which make them inapplicable to clustering. Furthermore, these methods
only rely on labeled data to learn the surrogate model and ignore the vast amounts of unlabeled data.
Work done at ASAPP Inc
1
arXiv:2210.00064v1 [cs.LG] 30 Sep 2022
In this paper, we present CEREAL (Cluster Evaluation with REstricted Availability of Labels), a
comprehensive framework for few-sample clustering evaluation without any explicit assumptions on
the evaluation metric or clustering algorithm. We propose several improvements to the standard active
sampling pipeline. First, we derive acquisition functions based on normalized mutual information, a
popular evaluation metric for clustering. The choice of acquisition function depends on whether the
clustering algorithm returns a cluster assignment (hard clustering) or a distribution over clusters (soft
clustering). Then, we use a semi-supervised learning algorithm to train the surrogate model with both
labeled and unlabeled data. Finally, we pseudo-label the unlabeled data with the learned surrogate
model before estimating the evaluation metric.
Our experiments across multiple real-world datasets, clustering algorithms, and evaluation metrics
show that CEREAL accurately and reliably estimates the clustering quality much better than several
baselines. Our results show that CEREAL reduces the area under the absolute error curve (AEC) up to
57% compared to uniform sampling. In fact, CEREAL reduces the AEC up to 65% compared to the
best performing active sampling method, which typically produces biased underestimates of NMI. In
an extensive ablation study we observe that the combination of semi-supervised learning and pseudo-
labeling is crucial for optimal performance as each component on its own might hurt performance
(see Table 1). We also validate the robustness of our framework across multiple clustering algorithms
– namely K-Means, spectral clustering, and BIRCH – to estimate a wide range evaluation metrics –
namely normalized mutual information (NMI), adjusted mutual information (AMI), and adjusted
rand index (ARI). Finally, we show that CEREAL can be extended from clusterwise annotations
to pairwise annotations by using the surrogate model to pseudo-label the dataset. Our results with
pairwise annotations show that pseudo-labeling can approximate the evaluation metric but requires
significantly more annotations than clusterwise annotations to achieve similar estimates.
We summarize our contributions as follows:
We introduce CEREAL, a framework for few-sample clustering evaluation. To the best of
our knowledge, we are the first to investigate the problem of evaluating clustering with
a limited labeling budget. Our solution uses a novel combination of active sampling and
semi-supervised learning, including new NMI-based acquisition functions.
Our experiments in the active sampling pipeline show that CEREAL almost always achieves
the lowest AEC across language and vision datasets. We also show that our framework reli-
ably estimates the quality of the clustering across different clustering algorithms, evaluation
metrics, and annotation types.
2 RELATED WORK
Cluster Evaluation
The trade-offs associated with different types of clustering evaluation are
well-studied in the literature (Rousseeuw, 1987; Rosenberg & Hirschberg, 2007; Vinh et al., 2010;
G
¨
osgens et al., 2021). Clustering evaluation metrics - oftentimes referred to as validation indices -
are either internal or external. Internal evaluation metrics gauge the quality of a clustering without
supervision and instead rely on the geometric properties of the clusters. However, they might not be
reliable as they do not account for the downstream task or make clustering specific assumptions (von
Luxburg et al., 2012; G
¨
osgens et al., 2021; Mishra et al., 2022). On the other hand, external evaluation
metrics require supervision, oftentimes in the form of ground truth annotations. Commonly used
external evaluation metrics are adjusted Rand index (Hubert & Arabie, 1985), V-measure (Rosenberg
& Hirschberg, 2007), and mutual information (Cover & Thomas, 2006) along with its normalized
and adjusted variants. We aim to estimate external evaluation metrics for a clustering with limited
labels for the ground truth or the reference clustering. Recently, Mishra et al. (2022) proposed a
framework to select the expected best clustering achievable given a hyperparameter tuning method
and a computation budget. Our work complements theirs by choosing best clustering under a given
labeling budget.
Few-sample Evaluation
Few-sample evaluation seeks to test model performance with limited
access to human annotations (Hutchinson et al., 2022). These approaches rely on different sampling
strategies such as stratified sampling (Kumar & Raj, 2018), importance sampling (Sawade et al.,
2010; Poms et al., 2021), and active sampling (Kossen et al., 2021). Our work is closely related
to few-sample model evaluation but focuses on clustering. Often, existing approaches tailor their
2
estimation to the classifier evaluation metric of interest, which make them ill-suited to evaluate
clustering (Sawade et al., 2010; Kossen et al., 2021). While we choose to focus on clustering, we
do not make further assumptions about the evaluation metrics. Finally, a few approaches have used
unlabeled examples to estimate performance of machine learning models (Welinder et al., 2013; Ji
et al., 2021). These use a Bayesian approach along with unlabeled data to get a better estimate of
precision and recall (Welinder et al., 2013), or group fairness (Ji et al., 2020) of learned models. In
our work, we estimate a clustering evaluation metric and use the unlabeled data to train the surrogate
model. Furthermore, we also propose novel acquisition functions that account for the distinctive
properties of clustering and label them proportional to the underlying evaluation metric.
Active Learning
Active learning, a well-studied area of few-sample learning (Cohn et al., 1996;
Settles, 2009; Gal et al., 2017; Kirsch et al., 2019; Ash et al., 2020; 2021), aims to train a model
by incrementally labeling data points until the labeling budget is exhausted. Most of these methods
differ in their acquisition functions to sub-sample and label data to train the models. We derive new
clustering-specific acquisition functions that account for the distinctive properties of clustering (see
Section 4). For a more in-depth introduction on active learning, we refer the reader to Settles (2009).
Semi-supervised Learning
Semi-supervised learning trains models using limited labeled data
alongside large amounts of unlabeled data (Berthelot et al., 2019; Xie et al., 2020; Sohn et al., 2020).
These methods use several strategies such as data augmentation and consistency regularization to
effectively use the unlabeled data. In our work we use FixMatch (Sohn et al., 2020), a general-purpose
semi-supervised learning method, and pseudo-labeling and show they are key to the success of our
framework. We can also use a more domain-specific semi-supervised learning algorithm instead of
FixMatch in CEREAL.
3 PRELIMINARIES
Few-sample Cluster Evaluation
Given a dataset, the problem of external evaluation metric esti-
mates the clustering quality by how much a (learned) test clustering differs from another (annotated)
reference clustering by comparing the cluster assignments. Assuming that a clustering is a function
fc:X 7→ K
from a dataset
X={x1, x2, ..., xn}
to a set of clusters
K={k1, k2, ..., km}
, then we
rewrite cluster evaluation as computing a scalar function
v
between two clusterings:
v(fc(X), fy(X))
.
In the few-sample case, one of the two clusterings, which we assume to be the reference clustering, is
only partially known. For convenience, we denote:
C:={fc(x)|∀x:x X },the test clustering.
Y:={fy(x)|∀x:x X },the reference clustering.
YL:={fy(x)|∀x:x∈ XL X },the partially labeled subset of the reference clustering.
XU:=X \ XL,the subset of the dataset without reference labels.
The problem then becomes estimating
v(C, Y )
from the partial reference clustering, i.e. computing
an estimate
ˆv(C, YL)
such that
|v(C, Y )ˆv(C, YL)|
is minimized. While there are no formal
requirements on
X
other than being clusterable, we assume for the rest of this paper that
X
is a set of
vectors.
Normalized Mutual Information
Normalized mutual information (NMI) measures the mutual
dependence of two random variables with a number in the interval
[0,1]
. As outlined in G
¨
osgens
et al. (2021), it satisfies several desirable properties; it is easily interpretable, has linear complexity,
is symmetric, and monotonically increases with increasing overlap to the reference clustering. In
cluster evaluation, we treat the test clustering
C
and the reference clustering
Y
as random variables
and calculate the NMI(C, Y )as follows:
NMI(C;Y) = PyYPcCICY [c;y]
(HY+HC)/2,
ICY [c;y] = p(c, y) log p(c, y)
p(y)p(c),
HY=X
yY
p(y) log p(y),
HC=X
cC
p(c) log p(c).
3
摘要:

CEREAL:FEW-SAMPLECLUSTERINGEVALUATIONNihalV.Nayaky,EthanR.Elenbergx,ClemensRosenbaumxyDepartmentofComputerScience,BrownUniversityfnnayak2g@cs.brown.eduxASAPPIncfeelenberg,crosenbaumg@asapp.comABSTRACTEvaluatingclusteringqualitywithreliableevaluationmetricslikenormalizedmu-tualinformation(NMI)requir...

展开>> 收起<<
CEREAL FEW-SAMPLE CLUSTERING EVALUATION Nihal V . Nayaky Ethan R. Elenbergx Clemens Rosenbaumx yDepartment of Computer Science Brown University_2.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:426.75KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注