competition between users) and do not consider anonymity. To the best of our knowledge this paper
is the first attempt to formalize and study multi-armed bandits with multiple users from an anonymity
perspective.
Learning how to assign many users to a (relatively) small set of arms can also be thought of as a
clustering problem. Clustering of users in multi-user multi-arm bandits has been previously studied.
Maillard and Mannor [2014] first studied sequentially clustering users, which was later followed up
by other researchers [Nguyen and Lauw, 2014, Gentile et al., 2017, Korda et al., 2016]. Although
these works, similar to us, attempt to cluster the users, they are allowed to observe each individual’s
reward and optimize based on that, which contradicts our anonymity requirement.
One technically related line of work that we heavily rely on is recent work on batched bandits. In the
problem of batched bandits, the learner is prevented from iteratively and adaptively making decisions
each round; instead the learning algorithm runs in a small number of “batches”, and in each batch
the learner chooses a set of arms to pull, and observes the outcome at the end of the batch. Batched
multi-armed bandits were initially studied by Perchet et al. [2016] for the particular case of two arms.
Later Gao et al. [2019] studied the problem for multiple arms. Esfandiari et al. [2019] improved
the result of Gao et al. and extended it to linear bandits and adversarial multi-armed bandits. Later
this problem was studied for batched Thompson sampling [Kalkanli and Ozgur, 2021, Karbasi et al.,
2021], Gaussian process bandit optimization [Li and Scarlett, 2021] and contextual bandits [Zhang
et al., 2021, Zanette et al., 2021].
In another related line of work, there have been several successful attempts to apply different notions
of privacy such as differential privacy to multi-armed bandit settings [Tossou and Dimitrakakis, 2016,
Shariff and Sheffet, 2018, Dubey and Pentland, 2020, Basu et al., 2019]. While these papers provide
very promising guarantees of privacy measures, they focus on single-user settings. In this work we
take advantage of the fact that there are several similar users in the system, and use this to provide
guarantees of anonymity. Anonymity and privacy go hand in hand, and in a practical scenario, both
lines of works can be combined to provide a higher level of privacy.
Finally, our setting has some similarities to the settings of stochastic linear bandits [Dani et al., 2008,
Rusmevichientong and Tsitsiklis, 2010, Abbasi-Yadkori et al., 2011] and stochastic combinatorial
bandits [Chen et al., 2013, Kveton et al., 2015]. For example, the superficially similar problem of
assigning
N
users to
K
arms each round so that each arm has at least
C
users assigned to it (and
where you get to observe the total reward per round) can be solved directly by algorithms for these
frameworks. However, although such assignments are
C
-anonymous, there are important subtleties
that prevent us from directly applying these techniques in our model. First of all, we do not actually
constrain the assignment of users to arms – rather, our notion of anonymity constrains what feedback
we can obtain from such an assignment (e.g., it is completely fine for us to assign zero users to an
arm, whereas in the above model no actions are possible when
N < CK
). Secondly, we obtain more
nuanced feedback than is assumed in these frameworks (specifically, we get to learn the reward of
each group of
≥C
users, instead of just the total aggregate reward). Nonetheless it is an interesting
open question if any of these techniques can be applied to improve our existing regret bounds (perhaps
some form of linear bandits over the anonymity polytopes defined in Section 3.4.2).
2 Model and preliminaries
Notation.
We write
[N]
as shorthand for the set
{1,2, . . . , N}
. We write
˜
O(·)
to suppress any
poly-logarithmic factors (in
N
,
K
,
C
, or
T
) that arise. We say a random variable
X
is
σ2
-subgaussian
if the mean-normalized variable Y=X−E[x]satisfies E[exp(sY )] ≤exp(σ2s2/2) for all s∈R.
Proofs of most theorems have been postponed to Appendix C of the Supplemental Material in interest
of brevity.
2.1 Anonymous bandits
In the problem of anonymous bandits, there are
N
users. Each round (for
T
rounds), our algorithm
must assign each user to one of
K
arms (multiple users can be assigned to the same arm). If user
i
plays arm
j
, they receive a reward drawn independently from a 1-subgaussian distribution
Di,j
with (unknown) mean
µi,j ∈[0,1]
. We would like to minimize their overall expected regret of our
algorithm. That is, if user ireceives reward ri,t in round t, we would like to minimize
3