Improved Learning-augmented Algorithms for k-means and k-medians Clustering Thy Nguyen Anamay Chaturvedi Huy Lê Nguy ên

2025-05-08 0 0 771.83KB 19 页 10玖币
侵权投诉
Improved Learning-augmented Algorithms for k-means and
k-medians Clustering
Thy Nguyen
, Anamay Chaturvedi
, Huy Lê Nguy˜
ên
{nguyen.thy2,chaturvedi.a,hu.nguyen} @northeastern.edu
Khoury College of Computer Sciences,
Northeastern University
March 2, 2023
Abstract
We consider the problem of clustering in the learning-augmented setting, where we are given a data
set in d-dimensional Euclidean space, and a label for each data point given by an oracle indicating what
subsets of points should be clustered together. This setting captures situations where we have access to
some auxiliary information about the data set relevant for our clustering objective, for instance the labels
output by a neural network. Following prior work, we assume that there are at most an α(0, c)for some
c < 1fraction of false positives and false negatives in each predicted cluster, in the absence of which the
labels would attain the optimal clustering cost OPT. For a dataset of size m, we propose a deterministic k-
means algorithm that produces centers with improved bound on clustering cost compared to the previous
randomized algorithm while preserving the O(dm log m)runtime. Furthermore, our algorithm works even
when the predictions are not very accurate, i.e. our bound holds for αup to 1/2, an improvement over
αbeing at most 1/7in the previous work. For the k-medians problem we improve upon prior work by
achieving a biquadratic improvement in the dependence of the approximation factor on the accuracy
parameter αto get a cost of (1 + O(α))OPT, while requiring essentially just O(md log3m/α)runtime.
1 Introduction
In this paper we study k-means and k-medians clustering in the learning-augmented setting. In both these
problems we are given an input data set Pof mpoints in d-dimensional Euclidean space and an associated
distance function dist(·,·). The goal is to compute a set Cof kpoints in that same space that minimize the
following cost function:
cost(P, C) = X
pP
min
i[k]dist(p, ci).
In words, the cost associated with a singular data point is its distance to the closest point in C, and the cost
of the whole data set is the sum of the costs of its individual points.
In the k-means setting dist(x, y) := kxyk2, i.e., the square of the Euclidean distance, and in the k-
medians setting we set dist(x, y) := kxyk, although here instead of the norm of xy, we can in principle
also use any other distance function. These problem are well-studied in the literature of algorithms and
machine learning, and are known to be hard to solve exactly [Dasgupta, 2008], or even approximate well
beyond a certain factor [Cohen-Addad and Karthik C. S., 2019]. Although approximation algorithms are
known to exist for this problem and are used widely in practice, the theoretical approximation factors of
practical algorithms can be quite large, e.g., the 50-approximation in Song and Rajasekaran [2010] and the
Equal contribution. All three authors were supported in part by NSF CAREER grant CCF-1750716 and NSF grant
CCF-1909314.
1
arXiv:2210.17028v2 [cs.LG] 1 Mar 2023
O(ln k)-approximation in Arthur and Vassilvitskii [2006]. Meanwhile, the algorithms with relatively tight
approximation factors do not necessarily scale well in practice [Ahmadian et al., 2019].
To overcome these computational barriers, Ergun et al. [2022] proposed a learning-augmented setting
where we have access to some auxiliary information about the input data set. This is motivated by the
fact that in practice we expect the dataset of interest to have exploitable structures relevant to the optimal
clustering. For instance, a classifier’s predictions of points in a dataset can help group similar instances
together. This notion was formalized in Ergun et al. [2022] by assuming that we have access to a predictor
in the form of a labelling P=P1∪ · · · ∪ Pk(all the points in Pihave the same label i[k]), such that there
exist an unknown optimal clustering P=P
1∪ · · · ∪ P
k, an associated set of centers C= (c
1, . . . , c
k)that
achieve the optimally low clustering cost OPT (Pi[k]cost(Pi,{c
i}) = OPT), and a known label error rate
αsuch that:
|PiP
i| ≥ (1 α) max(|Pi|,|P
i|)
In simpler terms, the auxiliary partitioning (P1, . . . , Pk)is close to some optimal clustering: each predicted
cluster has at most an α-fraction of points from outside its corresponding optimal cluster, and there are at
most an α-fraction of points in the corresponding optimal cluster not included in predicted cluster. The
predictor, in other words, has at most αfalse positive and false negative rate for each label.
Observe that even when the predicted clusters Piare close to a set of true clusters P
iin the sense
that the label error rate αis very small, computing the means or medians of Pican lead to arbitrarily bad
solutions. It is known that for k-means the point that is allocated for an optimal cluster should simply be the
average of all points in that cluster (this can be seen by simply differentiating the convex 1-mean objective
and solving for the minimizer). However, a single false positive located far from the cluster can move this
allocated point arbitrarily far from the true points in the cluster and drive the cost up arbitrarily high. This
problem requires the clustering algorithms to process the predicted clusters in a way so as to preclude this
possibility.
Using tools from the robust statistics literature, the authors of Ergun et al. [2022] proposed a randomized
algorithm that achieves a (1 + 20α)-approximation given a label error rate α < 1/7and a guarantee that
each predicted cluster has k
αpoints. For the k-medians problem, the authors of Ergun et al. [2022] also
proposed an algorithm that achieves a (1 + α0)-approximation if each predicted cluster contains n
kpoints
and a label rate αat most Oα04
klog k
α0,where the big-Oh notation hides some small unspecified constant,
and α0<1.
The restrictions for the label error rate αto be small in both of the algorithms of Ergun et al. [2022] lead
us to investigate the following question:
Is it possible to design a k-means and a k-medians algorithm that achieve (1 + α)-approximate clustering
when the predictor is not very accurate?
1.1 Our Contributions
In this work, we not only give an affirmative answer to the question above for both the k-means and the
k-medians problems, our algorithms also have improved bounds on the clustering cost, while preserving the
time complexity of the previous approaches and removing the requirement on a lower bound on the size of
each predicted cluster.
For learning-augmented k-means, we modify the main subroutine of the previous randomized algorithm
to get a deterministic method that works for all α < 1/2, which is the natural breaking point (as explained
below). In the regime where the k-means algorithm of Ergun et al. [2022] applies, we get improve the
approximation factor to 1+7.7α. For the larger domain α[0,1/2), we derive a more general expression
as reproduced in table 1. Furthermore, our algorithm has better bound on the clustering cost compared to
that of the previous approach, while preserving the O(md log m)runtime and not requiring a lower bound
on the size of each predicted cluster.
Our k-medians algorithm improves upon the algorithm in Ergun et al. [2022] by achieving a (1 + O(α))-
approximation for α < 1/2, thereby improving both the range of αas well as the dependence of the ap-
proximation factor on the label error rate from bi-quadratic to near-linear. For success probability 1δ,
our runtime is O(1
12αmd log3m
αlog klog(k)
(12α)δlog k
δ), so we see that by setting δ= 1/poly(k), we have just a
logarithmic dependence in the run-time on k, as opposed to a polynomial dependence.
2
Work, Problem Approx. Factor Label Error Range Time Complexity
Ergun et al. [2022], k-Means 1 + 20α(10 log m
m,1/7) O(md log m)
Algorithm 1, k-Means 1 + 5α2α2
(12α)(1α)[0,1/2) O(md log m)
1+7.7α[0,1/7)
Ergun et al. [2022], k-Medians 1 + ˜
O((kα)1/4)small constant O(md log3m+
poly(k, log m))
Algorithm 2, k-Medians 1 + 7+10α10α2
(1α)(12α)[0,1/2) ˜
O1
12αmd
log3mlog2k
δ
Table 1: Comparison of learning-augmented k-means and k-medians algorithms. We recall that mis the
data set size, dis the ambient dimension, αis the label error rate, and δis the failure probability (where
applicable). The success probability of the k-medians algorithm of Ergun et al. [2022] is 1poly(1/k). The
˜
Onotation hides some log factors to simplify the expressions.
Upper bound on α. Note that if the error label rate αequals 1/2, then even for three clusters there is
no longer a clear relationship between the predicted clusters and the related optimal clusters - for instance
given three optimal clusters P
1, P
2, P
3with equally many points, if for all i[3], the predicted clusters
Piconsist of half the points in P
iand half the points in P
(i+1) mod 3, then the label error rate α= 1/2is
achieved, but there is no clear relationship between P
iand Pi. In other words, it is not clear whether the
predicted labels give us any useful information about an optimal clustering. In this sense, α= 1/2is in a
way a natural stopping point for this problem.
1.2 Related Work
This work belongs to a growing literature on learning-augmented algorithms. Machine learning has been
used to improve algorithms for a number of classical problems, including data structures [Kraska et al., 2018,
Mitzenmacher, 2018, Lin et al., 2022], online algorithms [Purohit et al., 2018], graph algorithms [Khalil et al.,
2017, Chen et al., 2022a,b], computing frequency estimation [Du et al., 2021] , caching [Rohatgi, 2020, Wei,
2020], and support estimation [Eden et al., 2021]. We refer the reader to Mitzenmacher and Vassilvitskii
[2020] for an overview and applications of the framework.
Another relevant line of work is clustering with side information. The works Balcan and Blum [2008],
Awasthi et al. [2014], Vikram and Dasgupta [2016] studied an interactive clustering setting where an oracle
interactively provides advice about whether or not to merge two clusters. Basu et al. [2004] proposed an
active learning framework for clustering, where the algorithm has access to a predictor that determines if two
points should or should not belong to the same cluster. Ashtiani et al. [2016] introduced a semi-supervised
active clustering framework where the algorithm has access to a predictor that answers queries whether two
particular points belong in an optimal clustering. The goal is to produce a (1 + α)-approximate clustering
while minimizing the query complexity to the oracle.
Approximation stability, proposed in Balcan et al. [2013], is another assumption proposed to circumvent
the NP-hardness of approximation for k-means clustering. More formally, the concept of (c, α)-stability
requires that every c-approximate clustering is α-close to the optimal solution in terms of the fraction of
incorrectly clustered points. This is different from our setting, where at most an αfraction of the points are
incorrectly clustered and can worsen the clustering cost arbitrarily.
Gamlath et al. [2022] studies the problem of k-means clustering in the presence of noisy labels, where
the cluster label of each point created by either an adversarial or a random perturbation of the optimal
solution. Their Balanced Adversarial Noise Model assumes that the size of the symmetric difference between
the predicted cluster Piand optimal cluster P
iis bounded by α|P
i|. The algorithm uses a subroutine
with runtime exponential in kand dfor a fixed α1/4. In this work, we have different assumptions on
the predicted cluster cluster Piand the optimal cluster P
i. Moreover, our focus is on efficient algorithms
practical nearly linear-time algorithms that can scale to very large datasets for k-means and k-medians
clustering.
3
Algorithm 1 Deterministic Learning-augmented k-Means Clustering
Require: Data set Pof mpoints, Partition P=P1. . . Pkfrom a predictor, accuracy parameter α
for i[k]do
for j[d]do
Let ωi,j be the collection of all subsets of (1 α)micontiguous points in Pi,j .
Ii,j argminZωi,j cost(Z, Z) = argminZωi,j PzZz21
|Z|(Pz0inZ z0)2
end for
Let bci= (Ii,j )j[d]
end for
Return {bc1,...,bck}
2k-Means
We briefly recall some notation for ease of reference.
Definition 1. We make the following definitions:
1. The given data set is denoted as P, and m:= |P|. The output of the predictor is a partition (P1,...Pk)
of P. Further, mi:= |Pi|.
2. There exists an optimal partition (P
1, . . . , P
k)and centers (c
1, . . . , c
k)such that Pi[k]cost(P
i, c
i) =
OPT, the optimally low clustering cost for the data set P. Furthermore, m
i:= |P
i|. For each cluster
i[k], denote the set of true positives P
iPi=Qi. Recall that |Qi| ≥ (1 α) max(|Pi|,|P
i|), for
some α < 1/2.
3. We denote the average of a set Xby X. For the sets Xiand Piwe denote their projections onto the
j-th dimension by Xi,j and Pi,j , respectively.
Before we describe our algorithm, we recall why the naive solution of simply taking the average of each
cluster provided by the predictor is insufficient. Consider Pi, the set of points labeled iby the predictor.
Recall that the optimal 1-means solution for this set is its mean, Pi. Since the predictor is not perfect,
there might exist a number of points in Pithat are not actually in P
i. Thus, if the points in Pi\P
iare
significantly far away from P
i, they will increase the clustering cost arbitrary if we simply use Pias the
center. The following well-known identity formalizes this observation.
Lemma 2 (Inaba et al. [1994]).Consider a set XRdof size nand cRd,
cost(X, c) = min
c0Rdcost(X, c0) + n· kcXk2= cost(X, X) + n· kcXk2.
Ideally, we would like to be able to recover the set Qi=PiP
iand use the average of Qias the center. We
know that |Qi\P
i| ≤ αm
i. By lemma 3, it is not hard to show that cost(P
i, Qi)1 + α
1αcost(P
i, P
i) =
1 + α
1αcost(P
i, c
i), which also implies a 1 + α
1α- approximation for the problem.
Lemma 3. For any partition J1J2of a set JRof size n, if |J1| ≥ (1 λ)n, then |JJ1|2
λ
(1λ)ncost(J, J).
Since we do not have access to Qi, the main technical challenge is to filter out the outlier points in Pi
and construct a center close to Qi. Minimizing the distance of the center to Qiimplies reducing the distance
to c
ias well as the clustering cost.
Our algorithm for k-means, algorithm 1, iterates over all clusters given by the predictor and finds a set of
contiguous points of size (1α)miwith the smallest clustering cost in each dimension. At the high level, our
analysis shows that the average of the chosen points, Ii,j , is not too far away from that of the true positives,
Qi,j . This also implies that the additive clustering cost of Ii,j would not be too large. Since we can analyze
the clustering cost by bounding the cost in every cluster iand dimension j, for simplicity we will not refer
4
摘要:

ImprovedLearning-augmentedAlgorithmsfork-meansandk-mediansClusteringThyNguyen*,AnamayChaturvedi*,HuyLêNguyên*{nguyen.thy2,chaturvedi.a,hu.nguyen}@northeastern.eduKhouryCollegeofComputerSciences,NortheasternUniversityMarch2,2023AbstractWeconsidertheproblemofclusteringinthelearning-augmentedsetting,w...

展开>> 收起<<
Improved Learning-augmented Algorithms for k-means and k-medians Clustering Thy Nguyen Anamay Chaturvedi Huy Lê Nguy ên.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:771.83KB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注