Improved Learning-augmented Algorithms for k-means and k-medians Clustering Thy Nguyen Anamay Chaturvedi Huy Lê Nguy ên

2025-05-08 2 0 771.83KB 19 页 10玖币

侵权投诉

Improved Learning-augmented Algorithms for k-means and

k-medians Clustering

Thy Nguyen∗

, Anamay Chaturvedi∗

, Huy Lê Nguy˜

ên∗

{nguyen.thy2,chaturvedi.a,hu.nguyen} @northeastern.edu

Khoury College of Computer Sciences,

Northeastern University

March 2, 2023

Abstract

We consider the problem of clustering in the learning-augmented setting, where we are given a data

set in d-dimensional Euclidean space, and a label for each data point given by an oracle indicating what

subsets of points should be clustered together. This setting captures situations where we have access to

some auxiliary information about the data set relevant for our clustering objective, for instance the labels

output by a neural network. Following prior work, we assume that there are at most an α∈(0, c)for some

c < 1fraction of false positives and false negatives in each predicted cluster, in the absence of which the

labels would attain the optimal clustering cost OPT. For a dataset of size m, we propose a deterministic k-

means algorithm that produces centers with improved bound on clustering cost compared to the previous

randomized algorithm while preserving the O(dm log m)runtime. Furthermore, our algorithm works even

when the predictions are not very accurate, i.e. our bound holds for αup to 1/2, an improvement over

αbeing at most 1/7in the previous work. For the k-medians problem we improve upon prior work by

achieving a biquadratic improvement in the dependence of the approximation factor on the accuracy

parameter αto get a cost of (1 + O(α))OPT, while requiring essentially just O(md log3m/α)runtime.

1 Introduction

In this paper we study k-means and k-medians clustering in the learning-augmented setting. In both these

problems we are given an input data set Pof mpoints in d-dimensional Euclidean space and an associated

distance function dist(·,·). The goal is to compute a set Cof kpoints in that same space that minimize the

following cost function:

cost(P, C) = X

p∈P

min

i∈[k]dist(p, ci).

In words, the cost associated with a singular data point is its distance to the closest point in C, and the cost

of the whole data set is the sum of the costs of its individual points.

In the k-means setting dist(x, y) := kx−yk2, i.e., the square of the Euclidean distance, and in the k-

medians setting we set dist(x, y) := kx−yk, although here instead of the norm of x−y, we can in principle

also use any other distance function. These problem are well-studied in the literature of algorithms and

machine learning, and are known to be hard to solve exactly [Dasgupta, 2008], or even approximate well

beyond a certain factor [Cohen-Addad and Karthik C. S., 2019]. Although approximation algorithms are

known to exist for this problem and are used widely in practice, the theoretical approximation factors of

practical algorithms can be quite large, e.g., the 50-approximation in Song and Rajasekaran [2010] and the

∗Equal contribution. All three authors were supported in part by NSF CAREER grant CCF-1750716 and NSF grant

CCF-1909314.

arXiv:2210.17028v2 [cs.LG] 1 Mar 2023

O(ln k)-approximation in Arthur and Vassilvitskii [2006]. Meanwhile, the algorithms with relatively tight

approximation factors do not necessarily scale well in practice [Ahmadian et al., 2019].

To overcome these computational barriers, Ergun et al. [2022] proposed a learning-augmented setting

where we have access to some auxiliary information about the input data set. This is motivated by the

fact that in practice we expect the dataset of interest to have exploitable structures relevant to the optimal

clustering. For instance, a classiﬁer’s predictions of points in a dataset can help group similar instances

together. This notion was formalized in Ergun et al. [2022] by assuming that we have access to a predictor

in the form of a labelling P=P1∪ · · · ∪ Pk(all the points in Pihave the same label i∈[k]), such that there

exist an unknown optimal clustering P=P∗

1∪ · · · ∪ P∗

k, an associated set of centers C= (c∗

1, . . . , c∗

k)that

achieve the optimally low clustering cost OPT (Pi∈[k]cost(Pi,{c∗

i}) = OPT), and a known label error rate

αsuch that:

|Pi∩P∗

i| ≥ (1 −α) max(|Pi|,|P∗

i|)

In simpler terms, the auxiliary partitioning (P1, . . . , Pk)is close to some optimal clustering: each predicted

cluster has at most an α-fraction of points from outside its corresponding optimal cluster, and there are at

most an α-fraction of points in the corresponding optimal cluster not included in predicted cluster. The

predictor, in other words, has at most αfalse positive and false negative rate for each label.

Observe that even when the predicted clusters Piare close to a set of true clusters P∗

iin the sense

that the label error rate αis very small, computing the means or medians of Pican lead to arbitrarily bad

solutions. It is known that for k-means the point that is allocated for an optimal cluster should simply be the

average of all points in that cluster (this can be seen by simply diﬀerentiating the convex 1-mean objective

and solving for the minimizer). However, a single false positive located far from the cluster can move this

allocated point arbitrarily far from the true points in the cluster and drive the cost up arbitrarily high. This

problem requires the clustering algorithms to process the predicted clusters in a way so as to preclude this

possibility.

Using tools from the robust statistics literature, the authors of Ergun et al. [2022] proposed a randomized

algorithm that achieves a (1 + 20α)-approximation given a label error rate α < 1/7and a guarantee that

each predicted cluster has Ωk

αpoints. For the k-medians problem, the authors of Ergun et al. [2022] also

proposed an algorithm that achieves a (1 + α0)-approximation if each predicted cluster contains Ωn

kpoints

and a label rate αat most Oα04

klog k

α0,where the big-Oh notation hides some small unspeciﬁed constant,

and α0<1.

The restrictions for the label error rate αto be small in both of the algorithms of Ergun et al. [2022] lead

us to investigate the following question:

Is it possible to design a k-means and a k-medians algorithm that achieve (1 + α)-approximate clustering

when the predictor is not very accurate?

1.1 Our Contributions

In this work, we not only give an aﬃrmative answer to the question above for both the k-means and the

k-medians problems, our algorithms also have improved bounds on the clustering cost, while preserving the

time complexity of the previous approaches and removing the requirement on a lower bound on the size of

each predicted cluster.

For learning-augmented k-means, we modify the main subroutine of the previous randomized algorithm

to get a deterministic method that works for all α < 1/2, which is the natural breaking point (as explained

below). In the regime where the k-means algorithm of Ergun et al. [2022] applies, we get improve the

approximation factor to 1+7.7α. For the larger domain α∈[0,1/2), we derive a more general expression

as reproduced in table 1. Furthermore, our algorithm has better bound on the clustering cost compared to

that of the previous approach, while preserving the O(md log m)runtime and not requiring a lower bound

on the size of each predicted cluster.

Our k-medians algorithm improves upon the algorithm in Ergun et al. [2022] by achieving a (1 + O(α))-

approximation for α < 1/2, thereby improving both the range of αas well as the dependence of the ap-

proximation factor on the label error rate from bi-quadratic to near-linear. For success probability 1−δ,

our runtime is O(1

1−2αmd log3m

αlog klog(k/δ)

(1−2α)δlog k

δ), so we see that by setting δ= 1/poly(k), we have just a

logarithmic dependence in the run-time on k, as opposed to a polynomial dependence.

Work, Problem Approx. Factor Label Error Range Time Complexity

Ergun et al. [2022], k-Means 1 + 20α(10 log m

√m,1/7) O(md log m)

Algorithm 1, k-Means 1 + 5α−2α2

(1−2α)(1−α)[0,1/2) O(md log m)

1+7.7α[0,1/7)

Ergun et al. [2022], k-Medians 1 + ˜

O((kα)1/4)small constant O(md log3m+

poly(k, log m))

Algorithm 2, k-Medians 1 + 7+10α−10α2

(1−α)(1−2α)[0,1/2) ˜

O1

1−2αmd

log3mlog2k

δ

Table 1: Comparison of learning-augmented k-means and k-medians algorithms. We recall that mis the

data set size, dis the ambient dimension, αis the label error rate, and δis the failure probability (where

applicable). The success probability of the k-medians algorithm of Ergun et al. [2022] is 1−poly(1/k). The

Onotation hides some log factors to simplify the expressions.

Upper bound on α. Note that if the error label rate αequals 1/2, then even for three clusters there is

no longer a clear relationship between the predicted clusters and the related optimal clusters - for instance

given three optimal clusters P∗

1, P ∗

2, P ∗

3with equally many points, if for all i∈[3], the predicted clusters

Piconsist of half the points in P∗

iand half the points in P∗

(i+1) mod 3, then the label error rate α= 1/2is

achieved, but there is no clear relationship between P∗

iand Pi. In other words, it is not clear whether the

predicted labels give us any useful information about an optimal clustering. In this sense, α= 1/2is in a

way a natural stopping point for this problem.

1.2 Related Work

This work belongs to a growing literature on learning-augmented algorithms. Machine learning has been

used to improve algorithms for a number of classical problems, including data structures [Kraska et al., 2018,

Mitzenmacher, 2018, Lin et al., 2022], online algorithms [Purohit et al., 2018], graph algorithms [Khalil et al.,

2017, Chen et al., 2022a,b], computing frequency estimation [Du et al., 2021] , caching [Rohatgi, 2020, Wei,

2020], and support estimation [Eden et al., 2021]. We refer the reader to Mitzenmacher and Vassilvitskii

[2020] for an overview and applications of the framework.

Another relevant line of work is clustering with side information. The works Balcan and Blum [2008],

Awasthi et al. [2014], Vikram and Dasgupta [2016] studied an interactive clustering setting where an oracle

interactively provides advice about whether or not to merge two clusters. Basu et al. [2004] proposed an

active learning framework for clustering, where the algorithm has access to a predictor that determines if two

points should or should not belong to the same cluster. Ashtiani et al. [2016] introduced a semi-supervised

active clustering framework where the algorithm has access to a predictor that answers queries whether two

particular points belong in an optimal clustering. The goal is to produce a (1 + α)-approximate clustering

while minimizing the query complexity to the oracle.

Approximation stability, proposed in Balcan et al. [2013], is another assumption proposed to circumvent

the NP-hardness of approximation for k-means clustering. More formally, the concept of (c, α)-stability

requires that every c-approximate clustering is α-close to the optimal solution in terms of the fraction of

incorrectly clustered points. This is diﬀerent from our setting, where at most an αfraction of the points are

incorrectly clustered and can worsen the clustering cost arbitrarily.

Gamlath et al. [2022] studies the problem of k-means clustering in the presence of noisy labels, where

the cluster label of each point created by either an adversarial or a random perturbation of the optimal

solution. Their Balanced Adversarial Noise Model assumes that the size of the symmetric diﬀerence between

the predicted cluster Piand optimal cluster P∗

iis bounded by α|P∗

i|. The algorithm uses a subroutine

with runtime exponential in kand dfor a ﬁxed α≤1/4. In this work, we have diﬀerent assumptions on

the predicted cluster cluster Piand the optimal cluster P∗

i. Moreover, our focus is on eﬃcient algorithms

practical nearly linear-time algorithms that can scale to very large datasets for k-means and k-medians

clustering.

Algorithm 1 Deterministic Learning-augmented k-Means Clustering

Require: Data set Pof mpoints, Partition P=P1∪. . . Pkfrom a predictor, accuracy parameter α

for i∈[k]do

for j∈[d]do

Let ωi,j be the collection of all subsets of (1 −α)micontiguous points in Pi,j .

Ii,j ←argminZ∈ωi,j cost(Z, Z) = argminZ∈ωi,j Pz∈Zz2−1

|Z|(Pz0inZ z0)2

end for

Let bci= (Ii,j )j∈[d]

end for

Return {bc1,...,bck}

2k-Means

We brieﬂy recall some notation for ease of reference.

Deﬁnition 1. We make the following deﬁnitions:

1. The given data set is denoted as P, and m:= |P|. The output of the predictor is a partition (P1,...Pk)

of P. Further, mi:= |Pi|.

2. There exists an optimal partition (P∗

1, . . . , P ∗

k)and centers (c∗

1, . . . , c∗

k)such that Pi∈[k]cost(P∗

i, c∗

i) =

OPT, the optimally low clustering cost for the data set P. Furthermore, m∗

i:= |P∗

i|. For each cluster

i∈[k], denote the set of true positives P∗

i∩Pi=Qi. Recall that |Qi| ≥ (1 −α) max(|Pi|,|P∗

i|), for

some α < 1/2.

3. We denote the average of a set Xby X. For the sets Xiand Piwe denote their projections onto the

j-th dimension by Xi,j and Pi,j , respectively.

Before we describe our algorithm, we recall why the naive solution of simply taking the average of each

cluster provided by the predictor is insuﬃcient. Consider Pi, the set of points labeled iby the predictor.

Recall that the optimal 1-means solution for this set is its mean, Pi. Since the predictor is not perfect,

there might exist a number of points in Pithat are not actually in P∗

i. Thus, if the points in Pi\P∗

iare

signiﬁcantly far away from P∗

i, they will increase the clustering cost arbitrary if we simply use Pias the

center. The following well-known identity formalizes this observation.

Lemma 2 (Inaba et al. [1994]).Consider a set X⊂Rdof size nand c∈Rd,

cost(X, c) = min

c0∈Rdcost(X, c0) + n· kc−Xk2= cost(X, X) + n· kc−Xk2.

Ideally, we would like to be able to recover the set Qi=Pi∩P∗

iand use the average of Qias the center. We

know that |Qi\P∗

i| ≤ αm∗

i. By lemma 3, it is not hard to show that cost(P∗

i, Qi)≤1 + α

1−αcost(P∗

i, P ∗

i) =

1 + α

1−αcost(P∗

i, c∗

i), which also implies a 1 + α

1−α- approximation for the problem.

Lemma 3. For any partition J1∪J2of a set J⊂Rof size n, if |J1| ≥ (1 −λ)n, then |J−J1|2≤

(1−λ)ncost(J, J).

Since we do not have access to Qi, the main technical challenge is to ﬁlter out the outlier points in Pi

and construct a center close to Qi. Minimizing the distance of the center to Qiimplies reducing the distance

to c∗

ias well as the clustering cost.

Our algorithm for k-means, algorithm 1, iterates over all clusters given by the predictor and ﬁnds a set of

contiguous points of size (1−α)miwith the smallest clustering cost in each dimension. At the high level, our

analysis shows that the average of the chosen points, Ii,j , is not too far away from that of the true positives,

Qi,j . This also implies that the additive clustering cost of Ii,j would not be too large. Since we can analyze

the clustering cost by bounding the cost in every cluster iand dimension j, for simplicity we will not refer

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImprovedLearning-augmentedAlgorithmsfork-meansandk-mediansClusteringThyNguyen*,AnamayChaturvedi*,HuyLêNguyên*{nguyen.thy2,chaturvedi.a,hu.nguyen}@northeastern.eduKhouryCollegeofComputerSciences,NortheasternUniversityMarch2,2023AbstractWeconsidertheproblemofclusteringinthelearning-augmentedsetting,w...

展开>> 收起<<

Improved Learning-augmented Algorithms for k-means and k-medians Clustering Thy Nguyen Anamay Chaturvedi Huy Lê Nguy ên.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Improved Learning-augmented Algorithms for k-means and k-medians Clustering Thy Nguyen Anamay Chaturvedi Huy Lê Nguy ên

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: