CEREAL FEW-SAMPLE CLUSTERING EVALUATION Nihal V . Nayaky Ethan R. Elenbergx Clemens Rosenbaumx yDepartment of Computer Science Brown University_2

2025-04-30 0 0 426.75KB 14 页 10玖币

侵权投诉

CEREAL: FEW-SAMPLE CLUSTERING EVALUATION

Nihal V. Nayak†∗ , Ethan R. Elenberg§, Clemens Rosenbaum§

†Department of Computer Science, Brown University

{nnayak2}@cs.brown.edu

§ASAPP Inc

{eelenberg, crosenbaum}@asapp.com

ABSTRACT

Evaluating clustering quality with reliable evaluation metrics like normalized mu-

tual information (NMI) requires labeled data that can be expensive to annotate. We

focus on the underexplored problem of estimating clustering quality with limited

labels. We adapt existing approaches from the few-sample model evaluation litera-

ture to actively sub-sample, with a learned surrogate model, the most informative

data points for annotation to estimate the evaluation metric. However, we ﬁnd that

their estimation can be biased and only relies on the labeled data. To that end, we

introduce CEREAL, a comprehensive framework for few-sample clustering evalua-

tion that extends active sampling approaches in three key ways. First, we propose

novel NMI-based acquisition functions that account for the distinctive properties

of clustering and uncertainties from a learned surrogate model. Next, we use ideas

from semi-supervised learning and train the surrogate model with both the labeled

and unlabeled data. Finally, we pseudo-label the unlabeled data with the surrogate

model. We run experiments to estimate NMI in an active sampling pipeline on

three datasets across vision and language. Our results show that CEREAL reduces

the area under the absolute error curve by up to 57% compared to the best sampling

baseline. We perform an extensive ablation study to show that our framework

is agnostic to the choice of clustering algorithm and evaluation metric. We also

extend CEREAL from clusterwise annotations to pairwise annotations. Overall,

CEREAL can efﬁciently evaluate clustering with limited human annotations.

1 INTRODUCTION

Unsupervised clustering algorithms (Jain et al., 1999) partition a given dataset into meaningful groups

such that similar data points belong to the same cluster. Typically, data science and machine learning

practitioners can obtain a wide range of output clusterings by varying algorithm parameters such as

number of clusters and minimum inter-cluster distance (Mishra et al., 2022). However, evaluating

these clusterings can be challenging. Unsupervised evaluation metrics, such as Silhouette Index,

often do not correlate well with downstream performance (von Luxburg et al., 2012). On the other

hand, supervised evaluation metrics such as normalized mutual information (NMI) and adjusted Rand

index (ARI) require a labeled reference clustering. This supervised evaluation step introduces a costly

bottleneck which limits the applicability of clustering for exploratory data analysis. In this work, we

study an underexplored area of research: estimating the clustering quality with limited annotations.

Existing work on this problem, adapted from few-sample model evaluation, can often perform worse

than uniform random sampling (see Section 5). These works use learned surrogate models such

as multilayer perceptrons to identify the most informative unlabeled data from the evaluation set.

Similar to active learning, they then iteratively rank the next samples to be labeled according to an

acquisition function and the surrogate model’s predictions. However, many active sampling methods

derive acquisition functions tailored to a speciﬁc classiﬁcation or regression metric (Sawade et al.,

2010; Kossen et al., 2021), which make them inapplicable to clustering. Furthermore, these methods

only rely on labeled data to learn the surrogate model and ignore the vast amounts of unlabeled data.

∗Work done at ASAPP Inc

arXiv:2210.00064v1 [cs.LG] 30 Sep 2022

In this paper, we present CEREAL (Cluster Evaluation with REstricted Availability of Labels), a

comprehensive framework for few-sample clustering evaluation without any explicit assumptions on

the evaluation metric or clustering algorithm. We propose several improvements to the standard active

sampling pipeline. First, we derive acquisition functions based on normalized mutual information, a

popular evaluation metric for clustering. The choice of acquisition function depends on whether the

clustering algorithm returns a cluster assignment (hard clustering) or a distribution over clusters (soft

clustering). Then, we use a semi-supervised learning algorithm to train the surrogate model with both

labeled and unlabeled data. Finally, we pseudo-label the unlabeled data with the learned surrogate

model before estimating the evaluation metric.

Our experiments across multiple real-world datasets, clustering algorithms, and evaluation metrics

show that CEREAL accurately and reliably estimates the clustering quality much better than several

baselines. Our results show that CEREAL reduces the area under the absolute error curve (AEC) up to

57% compared to uniform sampling. In fact, CEREAL reduces the AEC up to 65% compared to the

best performing active sampling method, which typically produces biased underestimates of NMI. In

an extensive ablation study we observe that the combination of semi-supervised learning and pseudo-

labeling is crucial for optimal performance as each component on its own might hurt performance

(see Table 1). We also validate the robustness of our framework across multiple clustering algorithms

– namely K-Means, spectral clustering, and BIRCH – to estimate a wide range evaluation metrics –

namely normalized mutual information (NMI), adjusted mutual information (AMI), and adjusted

rand index (ARI). Finally, we show that CEREAL can be extended from clusterwise annotations

to pairwise annotations by using the surrogate model to pseudo-label the dataset. Our results with

pairwise annotations show that pseudo-labeling can approximate the evaluation metric but requires

signiﬁcantly more annotations than clusterwise annotations to achieve similar estimates.

We summarize our contributions as follows:

•

We introduce CEREAL, a framework for few-sample clustering evaluation. To the best of

our knowledge, we are the ﬁrst to investigate the problem of evaluating clustering with

a limited labeling budget. Our solution uses a novel combination of active sampling and

semi-supervised learning, including new NMI-based acquisition functions.

•

Our experiments in the active sampling pipeline show that CEREAL almost always achieves

the lowest AEC across language and vision datasets. We also show that our framework reli-

ably estimates the quality of the clustering across different clustering algorithms, evaluation

metrics, and annotation types.

2 RELATED WORK

Cluster Evaluation

The trade-offs associated with different types of clustering evaluation are

well-studied in the literature (Rousseeuw, 1987; Rosenberg & Hirschberg, 2007; Vinh et al., 2010;

osgens et al., 2021). Clustering evaluation metrics - oftentimes referred to as validation indices -

are either internal or external. Internal evaluation metrics gauge the quality of a clustering without

supervision and instead rely on the geometric properties of the clusters. However, they might not be

reliable as they do not account for the downstream task or make clustering speciﬁc assumptions (von

Luxburg et al., 2012; G

osgens et al., 2021; Mishra et al., 2022). On the other hand, external evaluation

metrics require supervision, oftentimes in the form of ground truth annotations. Commonly used

external evaluation metrics are adjusted Rand index (Hubert & Arabie, 1985), V-measure (Rosenberg

& Hirschberg, 2007), and mutual information (Cover & Thomas, 2006) along with its normalized

and adjusted variants. We aim to estimate external evaluation metrics for a clustering with limited

labels for the ground truth or the reference clustering. Recently, Mishra et al. (2022) proposed a

framework to select the expected best clustering achievable given a hyperparameter tuning method

and a computation budget. Our work complements theirs by choosing best clustering under a given

labeling budget.

Few-sample Evaluation

Few-sample evaluation seeks to test model performance with limited

access to human annotations (Hutchinson et al., 2022). These approaches rely on different sampling

strategies such as stratiﬁed sampling (Kumar & Raj, 2018), importance sampling (Sawade et al.,

2010; Poms et al., 2021), and active sampling (Kossen et al., 2021). Our work is closely related

to few-sample model evaluation but focuses on clustering. Often, existing approaches tailor their

estimation to the classiﬁer evaluation metric of interest, which make them ill-suited to evaluate

clustering (Sawade et al., 2010; Kossen et al., 2021). While we choose to focus on clustering, we

do not make further assumptions about the evaluation metrics. Finally, a few approaches have used

unlabeled examples to estimate performance of machine learning models (Welinder et al., 2013; Ji

et al., 2021). These use a Bayesian approach along with unlabeled data to get a better estimate of

precision and recall (Welinder et al., 2013), or group fairness (Ji et al., 2020) of learned models. In

our work, we estimate a clustering evaluation metric and use the unlabeled data to train the surrogate

model. Furthermore, we also propose novel acquisition functions that account for the distinctive

properties of clustering and label them proportional to the underlying evaluation metric.

Active Learning

Active learning, a well-studied area of few-sample learning (Cohn et al., 1996;

Settles, 2009; Gal et al., 2017; Kirsch et al., 2019; Ash et al., 2020; 2021), aims to train a model

by incrementally labeling data points until the labeling budget is exhausted. Most of these methods

differ in their acquisition functions to sub-sample and label data to train the models. We derive new

clustering-speciﬁc acquisition functions that account for the distinctive properties of clustering (see

Section 4). For a more in-depth introduction on active learning, we refer the reader to Settles (2009).

Semi-supervised Learning

Semi-supervised learning trains models using limited labeled data

alongside large amounts of unlabeled data (Berthelot et al., 2019; Xie et al., 2020; Sohn et al., 2020).

These methods use several strategies such as data augmentation and consistency regularization to

effectively use the unlabeled data. In our work we use FixMatch (Sohn et al., 2020), a general-purpose

semi-supervised learning method, and pseudo-labeling and show they are key to the success of our

framework. We can also use a more domain-speciﬁc semi-supervised learning algorithm instead of

FixMatch in CEREAL.

3 PRELIMINARIES

Few-sample Cluster Evaluation

Given a dataset, the problem of external evaluation metric esti-

mates the clustering quality by how much a (learned) test clustering differs from another (annotated)

reference clustering by comparing the cluster assignments. Assuming that a clustering is a function

fc:X 7→ K

from a dataset

X={x1, x2, ..., xn}

to a set of clusters

K={k1, k2, ..., km}

, then we

rewrite cluster evaluation as computing a scalar function

between two clusterings:

v(fc(X), fy(X))

In the few-sample case, one of the two clusterings, which we assume to be the reference clustering, is

only partially known. For convenience, we denote:

C:={fc(x)|∀x:x∈ X },the test clustering.

Y:={fy(x)|∀x:x∈ X },the reference clustering.

YL:={fy(x)|∀x:x∈ XL⊂ X },the partially labeled subset of the reference clustering.

XU:=X \ XL,the subset of the dataset without reference labels.

The problem then becomes estimating

v(C, Y )

from the partial reference clustering, i.e. computing

an estimate

ˆv(C, YL)

such that

|v(C, Y )−ˆv(C, YL)|

is minimized. While there are no formal

requirements on

other than being clusterable, we assume for the rest of this paper that

is a set of

vectors.

Normalized Mutual Information

Normalized mutual information (NMI) measures the mutual

dependence of two random variables with a number in the interval

[0,1]

. As outlined in G

osgens

et al. (2021), it satisﬁes several desirable properties; it is easily interpretable, has linear complexity,

is symmetric, and monotonically increases with increasing overlap to the reference clustering. In

cluster evaluation, we treat the test clustering

and the reference clustering

as random variables

and calculate the NMI(C, Y )as follows:

NMI(C;Y) = Py∈YPc∈CICY [c;y]

(HY+HC)/2,

ICY [c;y] = p(c, y) log p(c, y)

p(y)p(c),

HY=−X

y∈Y

p(y) log p(y),

HC=−X

c∈C

p(c) log p(c).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CEREAL:FEW-SAMPLECLUSTERINGEVALUATIONNihalV.Nayaky,EthanR.Elenbergx,ClemensRosenbaumxyDepartmentofComputerScience,BrownUniversityfnnayak2g@cs.brown.eduxASAPPIncfeelenberg,crosenbaumg@asapp.comABSTRACTEvaluatingclusteringqualitywithreliableevaluationmetricslikenormalizedmu-tualinformation(NMI)requir...

展开>> 收起<<

CEREAL FEW-SAMPLE CLUSTERING EVALUATION Nihal V . Nayaky Ethan R. Elenbergx Clemens Rosenbaumx yDepartment of Computer Science Brown University_2.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CEREAL FEW-SAMPLE CLUSTERING EVALUATION Nihal V . Nayaky Ethan R. Elenbergx Clemens Rosenbaumx yDepartment of Computer Science Brown University_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: