
Work, Problem Approx. Factor Label Error Range Time Complexity
Ergun et al. [2022], k-Means 1 + 20α(10 log m
√m,1/7) O(md log m)
Algorithm 1, k-Means 1 + 5α−2α2
(1−2α)(1−α)[0,1/2) O(md log m)
1+7.7α[0,1/7)
Ergun et al. [2022], k-Medians 1 + ˜
O((kα)1/4)small constant O(md log3m+
poly(k, log m))
Algorithm 2, k-Medians 1 + 7+10α−10α2
(1−α)(1−2α)[0,1/2) ˜
O1
1−2αmd
log3mlog2k
δ
Table 1: Comparison of learning-augmented k-means and k-medians algorithms. We recall that mis the
data set size, dis the ambient dimension, αis the label error rate, and δis the failure probability (where
applicable). The success probability of the k-medians algorithm of Ergun et al. [2022] is 1−poly(1/k). The
˜
Onotation hides some log factors to simplify the expressions.
Upper bound on α. Note that if the error label rate αequals 1/2, then even for three clusters there is
no longer a clear relationship between the predicted clusters and the related optimal clusters - for instance
given three optimal clusters P∗
1, P ∗
2, P ∗
3with equally many points, if for all i∈[3], the predicted clusters
Piconsist of half the points in P∗
iand half the points in P∗
(i+1) mod 3, then the label error rate α= 1/2is
achieved, but there is no clear relationship between P∗
iand Pi. In other words, it is not clear whether the
predicted labels give us any useful information about an optimal clustering. In this sense, α= 1/2is in a
way a natural stopping point for this problem.
1.2 Related Work
This work belongs to a growing literature on learning-augmented algorithms. Machine learning has been
used to improve algorithms for a number of classical problems, including data structures [Kraska et al., 2018,
Mitzenmacher, 2018, Lin et al., 2022], online algorithms [Purohit et al., 2018], graph algorithms [Khalil et al.,
2017, Chen et al., 2022a,b], computing frequency estimation [Du et al., 2021] , caching [Rohatgi, 2020, Wei,
2020], and support estimation [Eden et al., 2021]. We refer the reader to Mitzenmacher and Vassilvitskii
[2020] for an overview and applications of the framework.
Another relevant line of work is clustering with side information. The works Balcan and Blum [2008],
Awasthi et al. [2014], Vikram and Dasgupta [2016] studied an interactive clustering setting where an oracle
interactively provides advice about whether or not to merge two clusters. Basu et al. [2004] proposed an
active learning framework for clustering, where the algorithm has access to a predictor that determines if two
points should or should not belong to the same cluster. Ashtiani et al. [2016] introduced a semi-supervised
active clustering framework where the algorithm has access to a predictor that answers queries whether two
particular points belong in an optimal clustering. The goal is to produce a (1 + α)-approximate clustering
while minimizing the query complexity to the oracle.
Approximation stability, proposed in Balcan et al. [2013], is another assumption proposed to circumvent
the NP-hardness of approximation for k-means clustering. More formally, the concept of (c, α)-stability
requires that every c-approximate clustering is α-close to the optimal solution in terms of the fraction of
incorrectly clustered points. This is different from our setting, where at most an αfraction of the points are
incorrectly clustered and can worsen the clustering cost arbitrarily.
Gamlath et al. [2022] studies the problem of k-means clustering in the presence of noisy labels, where
the cluster label of each point created by either an adversarial or a random perturbation of the optimal
solution. Their Balanced Adversarial Noise Model assumes that the size of the symmetric difference between
the predicted cluster Piand optimal cluster P∗
iis bounded by α|P∗
i|. The algorithm uses a subroutine
with runtime exponential in kand dfor a fixed α≤1/4. In this work, we have different assumptions on
the predicted cluster cluster Piand the optimal cluster P∗
i. Moreover, our focus is on efficient algorithms
practical nearly linear-time algorithms that can scale to very large datasets for k-means and k-medians
clustering.
3