Learning priors in multi-armed bandit problems
Several recent works have considered the Bayesian
learning framework with short time-horizon [
29
,
35
,
21
]. The focus in this line of work is on the design
of algorithms that learn the prior, while acting with a fixed Bayesian algorithm (e.g., Thompson sampling).
While Bayesian learning with short time-horizon may be viewed as a special case of LMABs, the baseline
policy we compare ourselves to is the optimal
H
-step policy, which is a harder baseline than considering a
fixed Bayesian algorithm.
Latent MDPs
Some prior work considers the framework of Latent MDPs (LMDPs), which is the MDP
generalization of LMAB [
16
,
37
,
24
,
23
]. In particular, [
24
] has shown the information-theoretic limit of
learning in LMDPs with no assumptions on MDPs, i.e., an exponential number of sample episodes
Ω(AM)
is necessary to learn a near-optimal policy in LMDPs. In contrast, we show that in LMABs, the required
number of episodes can be polynomial in
A
. This does not contradict the result in [
24
], since their lower
bound construction comes from the challenge in state-exploration with latent contexts. In contrast, there is
no state-exploration issue for bandits, which enables the polynomial sample complexity in
A
. Furthermore,
our upper bound does not require any assumptions or additional information such as good initialization or
separations as in [
24
]. To the best of our knowledge, no existing results are known for learning a near optimal
policy of LMAB instances for M≥3without further assumptions.
Multitask RL with explicit clustering
We can cluster observations from each episode if we are given
sufficiently long time horizons
H=˜
Ω(A/γ2)
in each episode [
9
,
14
,
16
]. Here,
γ > 0
is the amount of
‘separation’ between contexts such that for all
m6=m0
,
maxa∈A kµm(a, ·)−µm0(a, ·)k2≥γ
where
µm
is
a mean-reward vector of actions in the
mth
context. We focus on significantly more general cases where
there is no obvious way of clustering observations, e.g., when
HA
or
µm
can be arbitrarily close to some
other
µm0
for
m6=m0
. If we are given a similar separation condition, we also show that a polynomial sample
complexity is achievable as long as
H=˜
O(M2/γ2)
(see also section C.2). Note that this could still be in
HAregime with large number of actions A.
Learning in POMDPs with full-rank observations
One popular learning approach in partially observable
systems is the tensor-decomposition method, which extracts the realization of model parameters from third-
order tensors [1,7,15]. However, the recovery of model parameters require specific geometric assumptions
on the full-rankness of a certain test-observation matrix. Furthermore, most prior work requires a uniform
reachability assumption, i.e., all latent spaces should be reached with non-negligible probabilities by any
exploration policy for the parameter recovery. Recent results in [
19
,
30
] have shown that the uniform
reachability assumption can be dropped with the optimism principle. However, they still require the full-
rankness of a test-observation matrix to keep the volume of a confidence set explicitly bounded. Since LMAB
instances do not necessarily satisfy the full-rankness assumption, their results do not imply an upper bound
for learning LMABs.
Reward-Mixing MDPs
Another closely related work is to learn an LMDP (MDP extension of LMAB)
with common state transition probabilities, and thus only the reward function changes depending on latent
contexts [
23
]. The authors in [
23
] have developed an efficient moment-matching based algorithm to learn a
near-optimal policy without any assumptions on separations or system dynamics. However, it can only handle
the
M= 2
case with balanced mixing weights
w1=w2= 1/2
. It is currently not obvious how to extend
their result to M≥3cases without incurring O(AM)sample complexity in LMABs.
Regime switching bandits
LMAB may be also seen as a special type of adversarial or non-stationary
bandits (e.g., [
5
,
3
,
13
,
36
]) with time steps being specified for when the underlying reward distributions (i.e.,
3