Tractable Optimality in Episodic Latent MABs Jeongyeol Kwon1 Yonathan Efroni2 Constantine Caramanis3 and Shie Mannor4 1Wisconsin Institute for Discovery UW-Madison

2025-05-06 0 0 776.45KB 42 页 10玖币
侵权投诉
Tractable Optimality in Episodic Latent MABs
Jeongyeol Kwon1, Yonathan Efroni2, Constantine Caramanis3, and Shie Mannor4
1Wisconsin Institute for Discovery, UW-Madison
2Meta, New York
3Department of Electrical and Computer Engineering, University of Texas at Austin
4Department of Electrical Engineering, Technion / NVIDIA
October 10, 2022
Abstract
We consider a multi-armed bandit problem with
M
latent contexts, where an agent interacts with the
environment for an episode of
H
time steps. Depending on the length of the episode, the learner may not be
able to estimate accurately the latent context. The resulting partial observation of the environment makes
the learning task significantly more challenging. Without any additional structural assumptions, existing
techniques to tackle partially observed settings imply the decision maker can learn a near-optimal policy
with
O(A)H
episodes, but do not promise more. In this work, we show that learning with polynomial
samples in
A
is possible. We achieve this by using techniques from experiment design. Then, through
a method-of-moments approach, we design a procedure that provably learns a near-optimal policy with
O(poly(A)+poly(M, H)min(M,H))
interactions. In practice, we show that we can formulate the moment-
matching via maximum likelihood estimation. In our experiments, this significantly outperforms the
worst-case guarantees, as well as existing practical methods.
1 Introduction
In Multi-Armed Bandits (MABs), an agent learns to act optimally by interacting with an unknown environment.
In many applications, interaction sessions are short, relative to the complexity of the overall task. As a
motivating example, we consider a large content website that interacts with users arriving to the site, by
making sequential recommendations. In each such episode, the system has only a few chances to recommend
items before the user leaves the website – an interaction time typically much smaller than the number of
actions or the types of users. In such settings, agents often have access to short horizon episodes, and have to
learn how to process observations from different episodes to learn the best adaptation strategy.
Motivated by such examples, we consider the problem of learning a near-optimal policy in Latent Multi-
Armed Bandits (LMABs). In each episode with time-horizon
H
, the agent interacts with one of
M
possible
MAB environments (e.g., type of user) randomly chosen by nature. Without knowing the identity of the
environment (we call this the latent context), an agent aims to maximize the expected cumulative reward per
episode (see Definition 2.1 for a formal description). The LMAB framework is different from the setting
considered in [
31
] where multiple episodes proceed in parallel without limiting the horizon of an episode
H
.
For long horizons
H
, we show that it is possible to find a near-optimal policy and to determine near-optimal
actions for each episode, as if we knew the hidden context. If
HA
where
A
is the total number of actions,
however, this is no longer possible. Instead, we aim to learn the best history-dependent policy for
H
time
steps.
1
arXiv:2210.03528v1 [cs.LG] 5 Oct 2022
Figure 1: Nature of the problem depends on the length of episodes (H) and number of different contexts (M)
1.1 Our Results and Contributions
In this work, we study the problem of learning a near optimal policy of an LMABs for both long and short
horizon
H
. In the long horizon case, we show the problem’s latent parameters are learnable. However, for
short horizon, the latent parameters may not be identifiable and a different perspective is required.
A naive approach to learn a near optimal policy of an LMAB starts by casting the problem as a Markov
Decision Process (MDP). Assume that the reward values are discrete and defined over a finite alphabet of
cardinality
Z
. By defining the state space as the set of all sequences of history observations an LMAB can
be formulated as an MDP with O((AZ)H)states. Appealing to standard reinforcement learning algorithms
(e.g., [18,6]) we can learn an -optimal policy with O(AZ)H/2samples.
A natural way to improve upon the naive approach is through using the unique structure of the LMAB
setting. In this work we focus on the following question: can we learn a near-optimal policy with fewer than
O(AZ)H/2samples as the naive approach?
Our work answers the above question affirmatively. Specifically, our main contributions are as follows.
We show that the dependence of our algorithm is
poly(A) + poly(H, M)min(H,M)
, and thus tractable when
either
H
or
M
is small, even for very large
A
. That is, we are particularly focused on the setting with a few
contexts or relatively short episodes (in comparison to
A
), i.e.,
M=O(1)
or
H=O(1)
, where a natural
objective is to learn a near-optimal history-dependent policy for
H
time steps (see also Figure 1in Appendix
??).
1.2 Related Work
Comparison of Regimes
Nature of learning a near-optimal policy in LMABs changes depending on the
length of episodes
H
and the number of contexts
M
(Figure 1). For instance, when
H= 1
or
M= 1
,
the problem of learning in LMAB is essentially equivalent to the classical Multi-Armed Bandits problem
that has been extensively studied in literature (e.g., see [
26
] and references therein). In another well-studied
literature of Bayesian learning, each episode is assumed to have a structured prior over contexts (
M→ ∞
,
e.g., all expected rewards of arms are independently sampled from beta conjugate priors). Depending on
the length of time-horizon
H
, the problem can be solved with Thompson Sampling (TS) [
38
]
H→ ∞
, or
the problem is reduced to learning the prior [
21
]. When the time-horizon is sufficiently long but finite e.g.,
if
H= Ω(A/γ2)
for some separation parameters between a finite number of contexts
M <
, then it is
possible cluster observations from every episode. This setting has been studied in the literature of multitask
RL which we describe below. In this work, we are particularly focused on the setting with a few contexts
and relatively short episodes (in comparison to
A
), i.e.,
M=O(1)
and
H=O(1)
, where the most natural
objective is to learn a near-optimal history-dependent policy.
2
Learning priors in multi-armed bandit problems
Several recent works have considered the Bayesian
learning framework with short time-horizon [
29
,
35
,
21
]. The focus in this line of work is on the design
of algorithms that learn the prior, while acting with a fixed Bayesian algorithm (e.g., Thompson sampling).
While Bayesian learning with short time-horizon may be viewed as a special case of LMABs, the baseline
policy we compare ourselves to is the optimal
H
-step policy, which is a harder baseline than considering a
fixed Bayesian algorithm.
Latent MDPs
Some prior work considers the framework of Latent MDPs (LMDPs), which is the MDP
generalization of LMAB [
16
,
37
,
24
,
23
]. In particular, [
24
] has shown the information-theoretic limit of
learning in LMDPs with no assumptions on MDPs, i.e., an exponential number of sample episodes
Ω(AM)
is necessary to learn a near-optimal policy in LMDPs. In contrast, we show that in LMABs, the required
number of episodes can be polynomial in
A
. This does not contradict the result in [
24
], since their lower
bound construction comes from the challenge in state-exploration with latent contexts. In contrast, there is
no state-exploration issue for bandits, which enables the polynomial sample complexity in
A
. Furthermore,
our upper bound does not require any assumptions or additional information such as good initialization or
separations as in [
24
]. To the best of our knowledge, no existing results are known for learning a near optimal
policy of LMAB instances for M3without further assumptions.
Multitask RL with explicit clustering
We can cluster observations from each episode if we are given
sufficiently long time horizons
H=˜
Ω(A/γ2)
in each episode [
9
,
14
,
16
]. Here,
γ > 0
is the amount of
‘separation’ between contexts such that for all
m6=m0
,
maxa∈A kµm(a, ·)µm0(a, ·)k2γ
where
µm
is
a mean-reward vector of actions in the
mth
context. We focus on significantly more general cases where
there is no obvious way of clustering observations, e.g., when
HA
or
µm
can be arbitrarily close to some
other
µm0
for
m6=m0
. If we are given a similar separation condition, we also show that a polynomial sample
complexity is achievable as long as
H=˜
O(M22)
(see also section C.2). Note that this could still be in
HAregime with large number of actions A.
Learning in POMDPs with full-rank observations
One popular learning approach in partially observable
systems is the tensor-decomposition method, which extracts the realization of model parameters from third-
order tensors [1,7,15]. However, the recovery of model parameters require specific geometric assumptions
on the full-rankness of a certain test-observation matrix. Furthermore, most prior work requires a uniform
reachability assumption, i.e., all latent spaces should be reached with non-negligible probabilities by any
exploration policy for the parameter recovery. Recent results in [
19
,
30
] have shown that the uniform
reachability assumption can be dropped with the optimism principle. However, they still require the full-
rankness of a test-observation matrix to keep the volume of a confidence set explicitly bounded. Since LMAB
instances do not necessarily satisfy the full-rankness assumption, their results do not imply an upper bound
for learning LMABs.
Reward-Mixing MDPs
Another closely related work is to learn an LMDP (MDP extension of LMAB)
with common state transition probabilities, and thus only the reward function changes depending on latent
contexts [
23
]. The authors in [
23
] have developed an efficient moment-matching based algorithm to learn a
near-optimal policy without any assumptions on separations or system dynamics. However, it can only handle
the
M= 2
case with balanced mixing weights
w1=w2= 1/2
. It is currently not obvious how to extend
their result to M3cases without incurring O(AM)sample complexity in LMABs.
Regime switching bandits
LMAB may be also seen as a special type of adversarial or non-stationary
bandits (e.g., [
5
,
3
,
13
,
36
]) with time steps being specified for when the underlying reward distributions (i.e.,
3
latent contexts) may change. The standard objective in non-stationary bandits is to find the best stationary
policy in hindsight with unlimited possible contexts. Recently, [
41
] considered a non-stationary bandit with
a finite number of contexts
M=O(1)
and the objective of finding the optimal history-dependent policy.
Their setting and goal subsume our goal of learning the optimal policy in LMABs; however, results in [
41
]
require linear independence between reward probability vectors, and thus their setting essentially falls into the
category of tractable POMDPs with full-rank observations.
Miscellaneous
There are other modeling approaches more suited for personalized recommendation systems
where multiple episodes proceed in parallel without limits on the time-horizon [
31
,
14
,
10
,
17
,
25
]. In such
problems, the goal is to quickly adapt policies for individual episodes assuming certain similarities between
tasks. In contrast, in episodic LMAB settings, we assume that every episode starts in a sequential order, and
our goal is to learn an optimal history-dependent policy that can maximize rewards for a single episode with
limited time horizon.
2 Preliminaries
2.1 Problem Setup
We define the problem of episodic latent multi-armed bandits with time-horizon H2as follows:
Definition 2.1 (Latent Multi-Armed Bandit (LMAB))
LMAB is a tuple
B= (A,{wm}M
m=1,{µm}M
m=1)
,
where
A
is a set of actions,
{wm}M
m=1
are the mixing weights such that a latent context
m
is randomly
chosen with probability
wm
, and
µm
is the model parameter that describes a reward distribution, i.e.,
Pµm(r|a) := P(r|m, a)
, according to an action
a∈ A
conditioning on a latent context
m
. (NB: each
µm
represents a probability model, not necessarily a mean reward vector).
We do not assume a priori knowledge of mixing weights. The bulk of this paper considers discrete reward
realizations, when the support of the reward distribution is finite and bounded. In Appendix B, we also show
that our results can be adapted to Gaussian reward distributions.
Assumption 2.2 (Discrete Rewards)
The reward distribution has finite and bounded support. The reward
attains a value in the set
Z
. We assume that for all
z∈ Z
we have
|z| ≤ 1
. We denote the cardinality of
Z
as
Zand assume that Z=O(1).
As an example, Bernoulli distribution satisfies Assumption 2.2 with
Z={0,1}
and
Z= 2
. We denote the
probability of observing a reward value
z
by playing an action
a
as
µm(a, z):=P(r=z|m, a)
in a context
m. We often use µmas a reward-probability vector in RAZ indexed by a tuple (a, z) A × Z.
At the beginning of every episode, a latent context
m[M]
is sampled from a mixing distribution
{wm}M
m=1
and fixed for
H
time steps, however we cannot observe
m
directly. We consider a policy class
Π
which contains all history-dependent policies
π: (A×Z)→ A
. Our goal is to find a near optimal
policy
πΠ
that maximizes the expected cumulative reward
V
for each episode
V?= maxπΠV(π) :=
EπhPH
t=1 rti,
where the expectation is taken over latent contexts and rewards generated by an LMAB
instance, and actions following a policy π.
Definition 2.3 (Approximate Planning Oracle)
A planning oracle receives an LMAB instance
B
and re-
turns an -approximate policy πsuch that V?V(π).
Concretely, the point-based value-iteration (PBVI) algorithm [
33
] is an
-approximate planning algorithm
which runs in time O(HMAZ(H2/)O(M)).
4
2.2 Experimental Design
We now give a high-level overview on experimental design techniques used in this work. Suppose we are
given a matrix
ΦRd×k
where
dk
. Define a distribution over the rows of
Φ
,
ρd
an element in the
d
-dimensional simplex. We want to select a small subset of rows of
Φ
which minimizes
g(ρ)
defined below:
G(ρ) = X
i[d]
ρ(ii,:Φ>
i,:, g(ρ) = max
i[d]kΦi,:k2
G(ρ)1,(1)
where
Φi,:Rk
be the
ith
row of
Φ
and
G(ρ)Rk×k
. To achieve this task we use results from the
experimental design literature. The following theorem shows the existence of
ρ
that minimizes
g(ρ)
with a
small support over the row indices of Φ:
Theorem 2.4 (Theorem 4.4 in [27])
There exists a probability distribution
ρ
such that
g(ρ)2k
and
|supp(ρ)| ≤ 4klog log k+ 16. Furthermore, we can compute such ρin time ˜
O(dk2).
As noted in [
27
], Theorem 2.4 can be obtained from results of Chapter 3 in [
39
]. Using this fundamental
theorem, [27] showed the following proposition:
Proposition 2.5 (Proposition 4.5 in [27])
Let
ρ
be a distribution over the rows of
Φ
that satisfies the condition
of Theorem 2.4. Suppose a vector
µRd
can be represented as a sum
µ=v+ ∆
where
vV
,
kk0
.
Let ηbe any small noise with η[1, 1]d. Then kΦˆ
θµk0+ (0+1)2kwhere
ˆ
θ=G(ρ)1X
i[d]
ρ(i)(µ(i) + η(i))Φi,:.(2)
Crucially, we use Proposition 2.5 to reduce the sample complexity in A.
2.3 Wasserstein Distance
We now give a brief overview on the Wasserstein distance and its applications in latent mixture models.
Wasserstein distance is a convenient error metric to measure the parameter distance between two latent
models
{(wm, νm)}M
m=1
and
{( ˆwm,ˆνm)}M
m=1
, where
wm,ˆwm
are mixing probabilities and
νm,ˆνm
are some
parameters for individual contexts.
Wasserstein distance is defined as follows. Let
ν
be a finite-support distribution over
{νm}M
m=1
with
probabilities
{wm}M
m=1
,i.e.,
γ=PM
m=1 wmδνm
where
δv
is a Direc-delta distribution with a single mass
on
vRn
. Similarly with parameters
{( ˆwm,ˆνm)}M
m=1
, define an atomic distribution
ˆγ=Pmˆwmδˆνm
. We
define a Wasserstein distance between γand ˆγwith respect to lnorm as the following:
W(γ, ˆγ) := inf
Γ
E(m,m0)Γ[kνmˆνm0k] = inf
ΓX
(m,m0)[M]2
Γ(m, m0)· kνmˆνm0k,
where the infimum is taken over all couplings over joint distributions
ν
and
ˆν
which are marginally distributed
as νand ˆνrespectively, i.e.,
Γ(m, m0)RM×M
+:PM
m0=1 Γ(m, m0) = wm,PM
m=1 Γ(m, m0) = ˆwm.(3)
One nice property of Wasserstein metric is that the distance measure is invariant to permutation of
individual components, and flexible with arbitrarily small mixing probabilities or close parameters for
different contexts [40,12].
5
摘要:

TractableOptimalityinEpisodicLatentMABsJeongyeolKwon1,YonathanEfroni2,ConstantineCaramanis3,andShieMannor41WisconsinInstituteforDiscovery,UW-Madison2Meta,NewYork3DepartmentofElectricalandComputerEngineering,UniversityofTexasatAustin4DepartmentofElectricalEngineering,Technion/NVIDIAOctober10,2022Abst...

展开>> 收起<<
Tractable Optimality in Episodic Latent MABs Jeongyeol Kwon1 Yonathan Efroni2 Constantine Caramanis3 and Shie Mannor4 1Wisconsin Institute for Discovery UW-Madison.pdf

共42页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:42 页 大小:776.45KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 42
客服
关注