Anonymous Bandits for Multi-User Systems Hossein Esfandiari Google Research

2025-04-30 0 0 517.47KB 22 页 10玖币
侵权投诉
Anonymous Bandits for Multi-User Systems
Hossein Esfandiari
Google Research
esfandiari@google.com
Vahab Mirrokni
Google Research
mirrokni@google.com
Jon Schneider
Google Research
jschnei@google.com
Abstract
In this work, we present and study a new framework for online learning in systems
with multiple users that provide user anonymity. Specifically, we extend the
notion of bandits to obey the standard
k
-anonymity constraint by requiring each
observation to be an aggregation of rewards for at least
k
users. This provides a
simple yet effective framework where one can learn a clustering of users in an
online fashion without observing any user’s individual decision. We initiate the
study of anonymous bandits and provide the first sublinear regret algorithms and
lower bounds for this setting.
1 Introduction
In many modern systems, the system learns the behavior of the users by adaptively interacting with
the users and adjusting to their responses. For example, in online advertisement, a website may show
a certain type of ads to a user, observe whether the user clicks on the ads or not, and based on this
feedback adjust the type of ads it serves the user. As part of this learning process, the system faces a
dilemma between “exploring” new types of ads, which may or may not seem interesting to the user,
and “exploiting” its knowledge of what types of ads historically seemed interesting to the user. This
concept is very well studied in the context of online learning.
In principle, users benefit from a system that automatically adapts to their preferences. However,
users may naturally worry about a system that observes all of their actions, and worry that the system
may use this personal information against them or mistakenly reveal it to hackers or untrustworthy
third parties. There therefore arises an additional dilemma between providing a better personalized
user experience and acquiring the users’ trust.
We study the problem of online learning in multi-user systems under a version of anonymity inspired
by
k
-anonymity [Sweeney, 2002]. In its most general form,
k
-anonymity is a property of anonymous
data guaranteeing that for every data point, there exist at least
k1
other indistinguishable data
points in the dataset. Although there exist more recent notions of privacy and anonymity with stronger
guarantees (e.g. various forms of differential privacy), k-anonymity is a simple and practical notion
of anonymity that remains commonly employed in practice and enforced in various legal settings
[Goldsteen et al., 2021, Saf, November 2019, Slijepˇ
cevi´
c et al., 2021].
To explain our notion of anonymity, consider the online advertisement application mentioned earlier.
In online advertisement, when a user visits a website the website selects a type of ad and shows that
user an ad of that type. The user may then choose to click on that ad or not, and a reward is paid
out based on whether the user clicks on the ad or not (this reward may represent either the utility of
the user or the revenue of the online advertisement system). Note that the ad here is chosen by the
website, and the fact that the website assigns a user a specific type of ad is not something we intend
to hide from the website. However, the decision to click on the ad is made by the user. We intend to
protect these individual decisions, while allowing the website to learn what general types of ads each
user is interested in. In particular, we enforce a form of group level
C
-anonymity on these decisions,
by forcing the system to group users into groups of at least
C
users and to treat each group equally by
Preprint. Under review.
arXiv:2210.12198v1 [cs.LG] 21 Oct 2022
assigning all users in the same group the same ad and only observing the total aggregate reward (e.g.
the total number of clicks) of these users.
More formally, we study this version of anonymity in a simple model of online learning based on the
popular multi-armed bandit setting. In the classic (stochastic) multi-armed bandits problem, there is
a learner with some number
K
of potential actions (“arms”), where each arm is associated with an
unknown distribution of rewards. In each round (for
T
rounds) the learner selects an arm and collects
a reward drawn from a distribution corresponding to that arm. The goal of the learner is to maximize
their total expected reward (or equivalently, minimize some notion of regret).
In our multi-user model, a centralized learner assigns
N
users to
K
arms, and each rewards for each
user/arm pair are drawn from some fixed, unknown distribution. Each round the learner proposes an
assignment of users to arms, upon which each user receives a reward from the appropriate user/arm
distribution. However, the learner is only allowed to record feedback about these rewards (and use
this feedback for learning) if they perform this assignment in a manner compatible with
C
-anonymity.
This entails partitioning the users into groups of size at least
C
, assigning all users in each group
to the same arm, and only observing this group’s aggregate rewards for this arm. For example, if
C= 3
in one round we may combine users
1
,
2
and
4
into a group and assign them all to arm
3
;
we would then observe as feedback the aggregate reward
r13 +r23 +r43
, where
rij
represents the
reward that user
i
experienced from arm
j
this round. The goal of the learner is to maximize the total
reward by efficiently learning the optimal action for each user, while at the same time preserving
the anonymity of individual rewards users experienced in specific rounds. See Section 2 for a more
detailed formalization of the model.
1.1 Our results
In this paper we provide low-regret algorithms for the anonymous bandit setting described above.
We present two algorithms which operate in different regimes (based on how users cluster into their
favorite arms):
If for each arm
j
there are at least
UC+1
users for which arm
j
is that user’s optimal arm,
then Algorithm 1 incurs regret at most
˜
O(NCαKT )
, where
α= max(1,dK(C+1)/Ue)
.
If there is no such guarantee (but
N > C
), then Algorithm 2 incurs regret at most
˜
O(C2/3K1/3T2/3).
We additionally prove the following corresponding lower bounds:
We show (Theorem 3) that the regret bound of Algorithm 1 is tight for any algorithm to
within a factor of KC(and to within a factor of Cfor UK(C+ 1)).
We show (Theorem 4) that the dependence on
T2/3
in Algorithm 2 is necessary in the
absence of a lower bound on U.
The main technical contribution of this work is the development/analysis of Algorithm 1. The core idea
behind Algorithm 1 is to use recent algorithms developed for the problem of batched bandits (where
instead of
T
rounds of adaptivity, users are only allowed
BT
rounds of adaptivity) to reduce
this learning problem to a problem in combinatorial optimization related to decomposing a bipartite
weighted graph into a collection of degree-constrained bipartite graphs. We then use techniques from
combinatorial optimization and convex geometry to come up with efficient approximation algorithms
for this combinatorial problem.
1.2 Related work
The bandits problem has been studied for almost a century [Thompson, 1933], and it has been
extensively studied in the standard single user version [Audibert et al., 2009a,b, Audibert and Bubeck,
2010, Auer et al., 2002, Auer and Ortner, 2010, Bubeck et al., 2013, Garivier and Cappé, 2011, Lai and
Robbins, 1985]. There exists recent work on bandits problem for systems with multiple users [Bande
and Veeravalli, 2019a,b, Bande et al., 2021, Vial et al., 2021, Buccapatnam et al., 2015, Chakraborty
et al., 2017, Kolla et al., 2018, Landgren et al., 2016, Sankararaman et al., 2019]. These papers
study this problem from game theoretic and optimization perspectives (e.g. studying coordination /
2
competition between users) and do not consider anonymity. To the best of our knowledge this paper
is the first attempt to formalize and study multi-armed bandits with multiple users from an anonymity
perspective.
Learning how to assign many users to a (relatively) small set of arms can also be thought of as a
clustering problem. Clustering of users in multi-user multi-arm bandits has been previously studied.
Maillard and Mannor [2014] first studied sequentially clustering users, which was later followed up
by other researchers [Nguyen and Lauw, 2014, Gentile et al., 2017, Korda et al., 2016]. Although
these works, similar to us, attempt to cluster the users, they are allowed to observe each individual’s
reward and optimize based on that, which contradicts our anonymity requirement.
One technically related line of work that we heavily rely on is recent work on batched bandits. In the
problem of batched bandits, the learner is prevented from iteratively and adaptively making decisions
each round; instead the learning algorithm runs in a small number of “batches”, and in each batch
the learner chooses a set of arms to pull, and observes the outcome at the end of the batch. Batched
multi-armed bandits were initially studied by Perchet et al. [2016] for the particular case of two arms.
Later Gao et al. [2019] studied the problem for multiple arms. Esfandiari et al. [2019] improved
the result of Gao et al. and extended it to linear bandits and adversarial multi-armed bandits. Later
this problem was studied for batched Thompson sampling [Kalkanli and Ozgur, 2021, Karbasi et al.,
2021], Gaussian process bandit optimization [Li and Scarlett, 2021] and contextual bandits [Zhang
et al., 2021, Zanette et al., 2021].
In another related line of work, there have been several successful attempts to apply different notions
of privacy such as differential privacy to multi-armed bandit settings [Tossou and Dimitrakakis, 2016,
Shariff and Sheffet, 2018, Dubey and Pentland, 2020, Basu et al., 2019]. While these papers provide
very promising guarantees of privacy measures, they focus on single-user settings. In this work we
take advantage of the fact that there are several similar users in the system, and use this to provide
guarantees of anonymity. Anonymity and privacy go hand in hand, and in a practical scenario, both
lines of works can be combined to provide a higher level of privacy.
Finally, our setting has some similarities to the settings of stochastic linear bandits [Dani et al., 2008,
Rusmevichientong and Tsitsiklis, 2010, Abbasi-Yadkori et al., 2011] and stochastic combinatorial
bandits [Chen et al., 2013, Kveton et al., 2015]. For example, the superficially similar problem of
assigning
N
users to
K
arms each round so that each arm has at least
C
users assigned to it (and
where you get to observe the total reward per round) can be solved directly by algorithms for these
frameworks. However, although such assignments are
C
-anonymous, there are important subtleties
that prevent us from directly applying these techniques in our model. First of all, we do not actually
constrain the assignment of users to arms – rather, our notion of anonymity constrains what feedback
we can obtain from such an assignment (e.g., it is completely fine for us to assign zero users to an
arm, whereas in the above model no actions are possible when
N < CK
). Secondly, we obtain more
nuanced feedback than is assumed in these frameworks (specifically, we get to learn the reward of
each group of
C
users, instead of just the total aggregate reward). Nonetheless it is an interesting
open question if any of these techniques can be applied to improve our existing regret bounds (perhaps
some form of linear bandits over the anonymity polytopes defined in Section 3.4.2).
2 Model and preliminaries
Notation.
We write
[N]
as shorthand for the set
{1,2, . . . , N}
. We write
˜
O(·)
to suppress any
poly-logarithmic factors (in
N
,
K
,
C
, or
T
) that arise. We say a random variable
X
is
σ2
-subgaussian
if the mean-normalized variable Y=XE[x]satisfies E[exp(sY )] exp(σ2s2/2) for all sR.
Proofs of most theorems have been postponed to Appendix C of the Supplemental Material in interest
of brevity.
2.1 Anonymous bandits
In the problem of anonymous bandits, there are
N
users. Each round (for
T
rounds), our algorithm
must assign each user to one of
K
arms (multiple users can be assigned to the same arm). If user
i
plays arm
j
, they receive a reward drawn independently from a 1-subgaussian distribution
Di,j
with (unknown) mean
µi,j [0,1]
. We would like to minimize their overall expected regret of our
algorithm. That is, if user ireceives reward ri,t in round t, we would like to minimize
3
Reg =T
N
X
i=1
max
jµi,j E"T
X
t=1
N
X
i=1
ri,t#.
Thus far, this simply describes
n
independent parallel instances of the classic multi-armed bandit
problem. We depart from this by imposing an anonymity constraint on how the learner is allowed to
observe the users’ feedback. This constraint is parameterized by a positive integer C(the minimum
group size). In a given round, the learner may partition a subset of the users into groups of size at
least
C
, under the constraint that the users within a single group must all be playing the same arm
during this round. For each group
G
, the learner then receives as feedback the total reward
PiGri,t
received by users in this group. Note that not all users must belong to a group (the learner simply
receives no feedback on such users), and the partition into groups is allowed to change from round to
round.
Without any constraint on the problem instance, it may be impossible to achieve sublinear regret (see
Section 3.6). We therefore additionally impose the following user-cluster assumption on the users:
each arm
j
is the optimal arm for at least
U
users. Such an assumption generally holds in practice
(e.g., in the regime where there are many users but only a few classes of arms). This also prevents
situations where, e.g., only a single user likes a given arm but it is hard to learn this without allocating
at least
C
users to this arm and sustaining significant regret. Typically we will take
U > C
; when
UCthe asymptotic regret bounds for our algorithms may be worse (see Section 3.5).
2.2 Batched stochastic bandits
Our main tool will be algorithms for batched stochastic bandits, as described in Gao et al. [2019]. For
our purposes, a batched bandit algorithm is an algorithm for the classical multi-armed bandit problem
that proceeds in
B
stages (“batches”) where the
b
th stage has a predefined length of
Db
rounds (with
PbDb=T
). At the beginning of each stage
b
, the algorithm outputs a non-empty subset of arms
Ab
(representing the set of arms the algorithm believes might still be optimal). At the end of each stage,
the algorithm expects at least
Db/|Ab|
independent instances of feedback from arm
j
for each
jS
;
upon receiving such feedback, the algorithm outputs the subset of arms
Ab+1
to explore in the next
batch.
In Gao et al. [2019], the authors design a batched bandit algorithm they call
BaSE
(batched successive-
elimination policy); for completeness, we reproduce a description of their algorithm in Appendix A.
When
B= log log T
, their algorithm incurs a worst-case expected regret of at most
˜
O(KT )
. In
our analysis, we will need the following slightly stronger bound on the behavior of BaSE:
Lemma 1.
Set
B= log log T
. Let
µ= maxj[K]µj
, and for each
j[K]
let
j=µµj
.
Then for each 1bB, we have that:
Db·Emax
jAb
j=˜
O(KT ).
In other words, Lemma 1 bounds the expected regret in each batch, even under the assumption that
we receive the reward of the worst active arm each round (even if we ask for feedback on a different
arm).
It will also be essential in the analysis that follows that the total number of rounds
Db
in the
b
th
batch only depends on
b
and is independent of the feedback received thus far (in the language of
Gao et al. [2019], the grid used by the batched bandit algorithm is static). This fact will let us run
several instances of this batched bandit algorithm in parallel, and guarantees that batches for different
instances will always have the same size.
3 Anonymous Bandits
3.1 Feedback-eliciting sub-algorithm
Our algorithms for anonymous bandits will depend crucially on the following sub-algorithm, which
allows us to take a matching from users to arms and recover (in
O(C)
rounds) an unbiased estimate
4
of each user’s reward (as long as that user is matched to a popular enough arm). More formally, let
π: [N][K]
be a matching from users to arms. We will show how to (in
2C+ 2
rounds) recover
an unbiased estimate of
µi,π(i)
for each
i
such that
|π1(π(i))| ≥ C+ 1
(i.e.
i
is matched to an arm
that at least Cother users are matched to).
The main idea behind this sub-algorithm is simple; for each user, we will get a sample of the total
reward of a group containing the user, and a sample of the total reward of the same group but minus
this user. The difference between these two samples is an unbiased estimate of the user’s reward.
More concretely, we follow these steps:
1.
Each round (for
2C+ 2
rounds) the learner will assign user
i
to arm
π(i)
. However, the
partition of users into groups will change over the course of these 2C+ 2 rounds.
2.
For each arm
j
such that
|π1(j)| ≥ C+ 1
, partition the users in
π1(j)
into groups of size
at least
C+ 1
and of size at most
2C+ 1
. Let
G1, G2, . . . , GS
be the set of groups formed
in this way (over all arms j). In each group, order the users arbitrarily.
3.
In the first round, the learner reports the partition into groups
{G1, G2, . . . , GS}
. For each
group Gs, let rs,0be the total aggregate reward for group Gsthis round.
4.
In the
k
th of the next
2C+ 1
rounds, the learner reports the partition into groups
{G1,k, G2,k, . . . , GS,k}
, where
Gs,k
is formed from
Gs
by removing the
k
th element (if
k > |Gs|
, then we set
Gs,k =Gs
). Let
rs,k
be the total aggregate reward from
Gs,k
reported
this round.
5.
If user
i
is the
k
th user in
Gs
, we return the estimate
ˆµi,π(i)=rs,0rs,k
of the average
reward for user iand arm π(i).
Lemma 2.
In the above procedure,
E[ˆµi,π(i)] = µi,π(i)
and
ˆµi,π(i)
is an
O(C)
-subgaussian random
variable.
3.2 Anonymous decompositions of bipartite graphs
The second ingredient we will need in our algorithm is the notion of an anonymous decomposition of
a weighted bipartite graph. Intuitively, by running our batched stochastic bandits algorithm, at the
beginning of each batch we will obtain a demand vector for each user (representing the number of
times that user would like feedback on each of the
m
arms). Based on this, we want to generate a
collection of assignments (from users to arms) which guarantee that we obtain (while maintaining
our anonymity guarantees) the requested amount of information for each user/arm pair.
Formally, we represent a weighted bipartite graph as a matrix of
nm
non-negative entries
wi,j
(representing the number of instances of feedback user
i
desires from arm
j
). We will assume that for
each i,Pjwi,j >0(each user is interested in at least one arm). A C-anonymous decomposition of
this graph is a collection of
R
assignments
M1, M2, . . . , MR
from users to arms (i.e., functions from
[N]to [K]) that satisfies the following properties:
1.
A user is never assigned to an arm for which they have zero demand. That is, if
Mr(i) = j
,
then wi,j 6= 0.
2.
If
Mr(i) = j
, and
|M1
r(j)| ≥ C+ 1
, we say that matching
Mr
is informative for the
user/arm pair
(i, j)
. (Note that this is exactly the condition required for the feedback-eliciting
sub-algorithm to output the unbiased estimate
ˆµi,j
when run on assignment
Mr
.) For each
user/arm pair (i, j), there must be at least wi,j informative assignments.
The weighted bipartite graphs that concern us come from the parallel output of
N
batched bandit
algorithms (described in Section 2.2) and have additional structure. These graphs can be described
by a positive total demand Dand a non-empty demand set Ai[K]for each user i(describing the
arms of interest to user
i
). If
jAi
, then
wi,j =D/|Ai|
; otherwise,
wi,j = 0
. To distinguish graphs
with the above structure from generic weighted bipartite graphs, we call such graphs batched graphs.
Moreover, for each user
i
, let
j(i)
be the optimal arm for user
i
(i.e.,
j(i) = arg maxiµi
). In
the algorithm we describe in the next section, with high probability,
j(i)
will always belong to
Ai
. Moreover, a user-cluster assumption of
U
implies that, for any arm
j
,
|j∗−1(j)| ≥ U
. This
means that in batched graphs that arise in our algorithm, there will exist an assignment where each
5
摘要:

AnonymousBanditsforMulti-UserSystemsHosseinEsfandiariGoogleResearchesfandiari@google.comVahabMirrokniGoogleResearchmirrokni@google.comJonSchneiderGoogleResearchjschnei@google.comAbstractInthiswork,wepresentandstudyanewframeworkforonlinelearninginsystemswithmultipleusersthatprovideuseranonymity.Speci...

展开>> 收起<<
Anonymous Bandits for Multi-User Systems Hossein Esfandiari Google Research.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:517.47KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注