Tractable Optimality in Episodic Latent MABs Jeongyeol Kwon1 Yonathan Efroni2 Constantine Caramanis3 and Shie Mannor4 1Wisconsin Institute for Discovery UW-Madison

2025-05-06 1 0 776.45KB 42 页 10玖币

侵权投诉

Tractable Optimality in Episodic Latent MABs

Jeongyeol Kwon1, Yonathan Efroni2, Constantine Caramanis3, and Shie Mannor4

1Wisconsin Institute for Discovery, UW-Madison

2Meta, New York

3Department of Electrical and Computer Engineering, University of Texas at Austin

4Department of Electrical Engineering, Technion / NVIDIA

October 10, 2022

Abstract

We consider a multi-armed bandit problem with

latent contexts, where an agent interacts with the

environment for an episode of

time steps. Depending on the length of the episode, the learner may not be

able to estimate accurately the latent context. The resulting partial observation of the environment makes

the learning task signiﬁcantly more challenging. Without any additional structural assumptions, existing

techniques to tackle partially observed settings imply the decision maker can learn a near-optimal policy

with

O(A)H

episodes, but do not promise more. In this work, we show that learning with polynomial

samples in

is possible. We achieve this by using techniques from experiment design. Then, through

a method-of-moments approach, we design a procedure that provably learns a near-optimal policy with

O(poly(A)+poly(M, H)min(M,H))

interactions. In practice, we show that we can formulate the moment-

matching via maximum likelihood estimation. In our experiments, this signiﬁcantly outperforms the

worst-case guarantees, as well as existing practical methods.

1 Introduction

In Multi-Armed Bandits (MABs), an agent learns to act optimally by interacting with an unknown environment.

In many applications, interaction sessions are short, relative to the complexity of the overall task. As a

motivating example, we consider a large content website that interacts with users arriving to the site, by

making sequential recommendations. In each such episode, the system has only a few chances to recommend

items before the user leaves the website – an interaction time typically much smaller than the number of

actions or the types of users. In such settings, agents often have access to short horizon episodes, and have to

learn how to process observations from diﬀerent episodes to learn the best adaptation strategy.

Motivated by such examples, we consider the problem of learning a near-optimal policy in Latent Multi-

Armed Bandits (LMABs). In each episode with time-horizon

, the agent interacts with one of

possible

MAB environments (e.g., type of user) randomly chosen by nature. Without knowing the identity of the

environment (we call this the latent context), an agent aims to maximize the expected cumulative reward per

episode (see Deﬁnition 2.1 for a formal description). The LMAB framework is diﬀerent from the setting

considered in [

] where multiple episodes proceed in parallel without limiting the horizon of an episode

For long horizons

, we show that it is possible to ﬁnd a near-optimal policy and to determine near-optimal

actions for each episode, as if we knew the hidden context. If

HA

where

is the total number of actions,

however, this is no longer possible. Instead, we aim to learn the best history-dependent policy for

time

steps.

arXiv:2210.03528v1 [cs.LG] 5 Oct 2022

Figure 1: Nature of the problem depends on the length of episodes (H) and number of diﬀerent contexts (M)

1.1 Our Results and Contributions

In this work, we study the problem of learning a near optimal policy of an LMABs for both long and short

horizon

. In the long horizon case, we show the problem’s latent parameters are learnable. However, for

short horizon, the latent parameters may not be identiﬁable and a diﬀerent perspective is required.

A naive approach to learn a near optimal policy of an LMAB starts by casting the problem as a Markov

Decision Process (MDP). Assume that the reward values are discrete and deﬁned over a ﬁnite alphabet of

cardinality

. By deﬁning the state space as the set of all sequences of history observations an LMAB can

be formulated as an MDP with O((AZ)H)states. Appealing to standard reinforcement learning algorithms

(e.g., [18,6]) we can learn an -optimal policy with O(AZ)H/2samples.

A natural way to improve upon the naive approach is through using the unique structure of the LMAB

setting. In this work we focus on the following question: can we learn a near-optimal policy with fewer than

O(AZ)H/2samples as the naive approach?

Our work answers the above question aﬃrmatively. Speciﬁcally, our main contributions are as follows.

We show that the dependence of our algorithm is

poly(A) + poly(H, M)min(H,M)

, and thus tractable when

either

is small, even for very large

. That is, we are particularly focused on the setting with a few

contexts or relatively short episodes (in comparison to

), i.e.,

M=O(1)

H=O(1)

, where a natural

objective is to learn a near-optimal history-dependent policy for

time steps (see also Figure 1in Appendix

??).

1.2 Related Work

Comparison of Regimes

Nature of learning a near-optimal policy in LMABs changes depending on the

length of episodes

and the number of contexts

(Figure 1). For instance, when

H= 1

M= 1

the problem of learning in LMAB is essentially equivalent to the classical Multi-Armed Bandits problem

that has been extensively studied in literature (e.g., see [

] and references therein). In another well-studied

literature of Bayesian learning, each episode is assumed to have a structured prior over contexts (

M→ ∞

e.g., all expected rewards of arms are independently sampled from beta conjugate priors). Depending on

the length of time-horizon

, the problem can be solved with Thompson Sampling (TS) [

]

H→ ∞

, or

the problem is reduced to learning the prior [

]. When the time-horizon is suﬃciently long but ﬁnite e.g.,

H= Ω(A/γ2)

for some separation parameters between a ﬁnite number of contexts

M < ∞

, then it is

possible cluster observations from every episode. This setting has been studied in the literature of multitask

RL which we describe below. In this work, we are particularly focused on the setting with a few contexts

and relatively short episodes (in comparison to

), i.e.,

M=O(1)

and

H=O(1)

, where the most natural

objective is to learn a near-optimal history-dependent policy.

Learning priors in multi-armed bandit problems

Several recent works have considered the Bayesian

learning framework with short time-horizon [

]. The focus in this line of work is on the design

of algorithms that learn the prior, while acting with a ﬁxed Bayesian algorithm (e.g., Thompson sampling).

While Bayesian learning with short time-horizon may be viewed as a special case of LMABs, the baseline

policy we compare ourselves to is the optimal

-step policy, which is a harder baseline than considering a

ﬁxed Bayesian algorithm.

Latent MDPs

Some prior work considers the framework of Latent MDPs (LMDPs), which is the MDP

generalization of LMAB [

]. In particular, [

] has shown the information-theoretic limit of

learning in LMDPs with no assumptions on MDPs, i.e., an exponential number of sample episodes

Ω(AM)

is necessary to learn a near-optimal policy in LMDPs. In contrast, we show that in LMABs, the required

number of episodes can be polynomial in

. This does not contradict the result in [

], since their lower

bound construction comes from the challenge in state-exploration with latent contexts. In contrast, there is

no state-exploration issue for bandits, which enables the polynomial sample complexity in

. Furthermore,

our upper bound does not require any assumptions or additional information such as good initialization or

separations as in [

]. To the best of our knowledge, no existing results are known for learning a near optimal

policy of LMAB instances for M≥3without further assumptions.

Multitask RL with explicit clustering

We can cluster observations from each episode if we are given

suﬃciently long time horizons

H=˜

Ω(A/γ2)

in each episode [

]. Here,

γ > 0

is the amount of

‘separation’ between contexts such that for all

m6=m0

maxa∈A kµm(a, ·)−µm0(a, ·)k2≥γ

where

µm

a mean-reward vector of actions in the

mth

context. We focus on signiﬁcantly more general cases where

there is no obvious way of clustering observations, e.g., when

HA

µm

can be arbitrarily close to some

other

µm0

for

m6=m0

. If we are given a similar separation condition, we also show that a polynomial sample

complexity is achievable as long as

H=˜

O(M2/γ2)

(see also section C.2). Note that this could still be in

HAregime with large number of actions A.

Learning in POMDPs with full-rank observations

One popular learning approach in partially observable

systems is the tensor-decomposition method, which extracts the realization of model parameters from third-

order tensors [1,7,15]. However, the recovery of model parameters require speciﬁc geometric assumptions

on the full-rankness of a certain test-observation matrix. Furthermore, most prior work requires a uniform

reachability assumption, i.e., all latent spaces should be reached with non-negligible probabilities by any

exploration policy for the parameter recovery. Recent results in [

] have shown that the uniform

reachability assumption can be dropped with the optimism principle. However, they still require the full-

rankness of a test-observation matrix to keep the volume of a conﬁdence set explicitly bounded. Since LMAB

instances do not necessarily satisfy the full-rankness assumption, their results do not imply an upper bound

for learning LMABs.

Reward-Mixing MDPs

Another closely related work is to learn an LMDP (MDP extension of LMAB)

with common state transition probabilities, and thus only the reward function changes depending on latent

contexts [

]. The authors in [

] have developed an eﬃcient moment-matching based algorithm to learn a

near-optimal policy without any assumptions on separations or system dynamics. However, it can only handle

the

M= 2

case with balanced mixing weights

w1=w2= 1/2

. It is currently not obvious how to extend

their result to M≥3cases without incurring O(AM)sample complexity in LMABs.

Regime switching bandits

LMAB may be also seen as a special type of adversarial or non-stationary

bandits (e.g., [

]) with time steps being speciﬁed for when the underlying reward distributions (i.e.,

latent contexts) may change. The standard objective in non-stationary bandits is to ﬁnd the best stationary

policy in hindsight with unlimited possible contexts. Recently, [

] considered a non-stationary bandit with

a ﬁnite number of contexts

M=O(1)

and the objective of ﬁnding the optimal history-dependent policy.

Their setting and goal subsume our goal of learning the optimal policy in LMABs; however, results in [

]

require linear independence between reward probability vectors, and thus their setting essentially falls into the

category of tractable POMDPs with full-rank observations.

Miscellaneous

There are other modeling approaches more suited for personalized recommendation systems

where multiple episodes proceed in parallel without limits on the time-horizon [

]. In such

problems, the goal is to quickly adapt policies for individual episodes assuming certain similarities between

tasks. In contrast, in episodic LMAB settings, we assume that every episode starts in a sequential order, and

our goal is to learn an optimal history-dependent policy that can maximize rewards for a single episode with

limited time horizon.

2 Preliminaries

2.1 Problem Setup

We deﬁne the problem of episodic latent multi-armed bandits with time-horizon H≥2as follows:

Deﬁnition 2.1 (Latent Multi-Armed Bandit (LMAB))

LMAB is a tuple

B= (A,{wm}M

m=1,{µm}M

m=1)

where

is a set of actions,

{wm}M

m=1

are the mixing weights such that a latent context

is randomly

chosen with probability

, and

µm

is the model parameter that describes a reward distribution, i.e.,

Pµm(r|a) := P(r|m, a)

, according to an action

a∈ A

conditioning on a latent context

. (NB: each

µm

represents a probability model, not necessarily a mean reward vector).

We do not assume a priori knowledge of mixing weights. The bulk of this paper considers discrete reward

realizations, when the support of the reward distribution is ﬁnite and bounded. In Appendix B, we also show

that our results can be adapted to Gaussian reward distributions.

Assumption 2.2 (Discrete Rewards)

The reward distribution has ﬁnite and bounded support. The reward

attains a value in the set

. We assume that for all

z∈ Z

we have

|z| ≤ 1

. We denote the cardinality of

Zand assume that Z=O(1).

As an example, Bernoulli distribution satisﬁes Assumption 2.2 with

Z={0,1}

and

Z= 2

. We denote the

probability of observing a reward value

by playing an action

µm(a, z):=P(r=z|m, a)

in a context

m. We often use µmas a reward-probability vector in RAZ indexed by a tuple (a, z)∈ A × Z.

At the beginning of every episode, a latent context

m∈[M]

is sampled from a mixing distribution

{wm}M

m=1

and ﬁxed for

time steps, however we cannot observe

directly. We consider a policy class

which contains all history-dependent policies

π: (A×Z)∗→ A

. Our goal is to ﬁnd a near optimal

policy

π∈Π

that maximizes the expected cumulative reward

for each episode

V?= maxπ∈ΠV(π) :=

EπhPH

t=1 rti,

where the expectation is taken over latent contexts and rewards generated by an LMAB

instance, and actions following a policy π.

Deﬁnition 2.3 (Approximate Planning Oracle)

A planning oracle receives an LMAB instance

and re-

turns an -approximate policy πsuch that V?−V(π)≤.

Concretely, the point-based value-iteration (PBVI) algorithm [

] is an



-approximate planning algorithm

which runs in time O(HMAZ(H2/)O(M)).

2.2 Experimental Design

We now give a high-level overview on experimental design techniques used in this work. Suppose we are

given a matrix

Φ∈Rd×k

where

dk

. Deﬁne a distribution over the rows of

ρ∈∆d

an element in the

-dimensional simplex. We want to select a small subset of rows of

which minimizes

g(ρ)

deﬁned below:

G(ρ) = X

i∈[d]

ρ(i)Φi,:Φ>

i,:, g(ρ) = max

i∈[d]kΦi,:k2

G(ρ)−1,(1)

where

Φi,:∈Rk

be the

ith

row of

and

G(ρ)∈Rk×k

. To achieve this task we use results from the

experimental design literature. The following theorem shows the existence of

that minimizes

g(ρ)

with a

small support over the row indices of Φ:

Theorem 2.4 (Theorem 4.4 in [27])

There exists a probability distribution

such that

g(ρ)≤2k

and

|supp(ρ)| ≤ 4klog log k+ 16. Furthermore, we can compute such ρin time ˜

O(dk2).

As noted in [

], Theorem 2.4 can be obtained from results of Chapter 3 in [

]. Using this fundamental

theorem, [27] showed the following proposition:

Proposition 2.5 (Proposition 4.5 in [27])

Let

be a distribution over the rows of

that satisﬁes the condition

of Theorem 2.4. Suppose a vector

µ∈Rd

can be represented as a sum

µ=v+ ∆

where

v∈V

k∆k∞≤0

Let ηbe any small noise with η∈[−1, 1]d. Then kΦˆ

θ−µk∞≤0+ (0+1)√2kwhere

θ=G(ρ)−1X

i∈[d]

ρ(i)(µ(i) + η(i))Φi,:.(2)

Crucially, we use Proposition 2.5 to reduce the sample complexity in A.

2.3 Wasserstein Distance

We now give a brief overview on the Wasserstein distance and its applications in latent mixture models.

Wasserstein distance is a convenient error metric to measure the parameter distance between two latent

models

{(wm, νm)}M

m=1

and

{( ˆwm,ˆνm)}M

m=1

, where

wm,ˆwm

are mixing probabilities and

νm,ˆνm

are some

parameters for individual contexts.

Wasserstein distance is deﬁned as follows. Let

be a ﬁnite-support distribution over

{νm}M

m=1

with

probabilities

{wm}M

m=1

,i.e.,

γ=PM

m=1 wmδνm

where

δv

is a Direc-delta distribution with a single mass

v∈Rn

. Similarly with parameters

{( ˆwm,ˆνm)}M

m=1

, deﬁne an atomic distribution

ˆγ=Pmˆwmδˆνm

. We

deﬁne a Wasserstein distance between γand ˆγwith respect to l∞norm as the following:

W(γ, ˆγ) := inf

E(m,m0)∼Γ[kνm−ˆνm0k∞] = inf

ΓX

(m,m0)∈[M]2

Γ(m, m0)· kνm−ˆνm0k∞,

where the inﬁmum is taken over all couplings over joint distributions

and

ˆν

which are marginally distributed

as νand ˆνrespectively, i.e.,

Γ(m, m0)∈RM×M

+:PM

m0=1 Γ(m, m0) = wm,PM

m=1 Γ(m, m0) = ˆwm.(3)

One nice property of Wasserstein metric is that the distance measure is invariant to permutation of

individual components, and ﬂexible with arbitrarily small mixing probabilities or close parameters for

diﬀerent contexts [40,12].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TractableOptimalityinEpisodicLatentMABsJeongyeolKwon1,YonathanEfroni2,ConstantineCaramanis3,andShieMannor41WisconsinInstituteforDiscovery,UW-Madison2Meta,NewYork3DepartmentofElectricalandComputerEngineering,UniversityofTexasatAustin4DepartmentofElectricalEngineering,Technion/NVIDIAOctober10,2022Abst...

展开>> 收起<<

Tractable Optimality in Episodic Latent MABs Jeongyeol Kwon1 Yonathan Efroni2 Constantine Caramanis3 and Shie Mannor4 1Wisconsin Institute for Discovery UW-Madison.pdf

共42页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Tractable Optimality in Episodic Latent MABs Jeongyeol Kwon1 Yonathan Efroni2 Constantine Caramanis3 and Shie Mannor4 1Wisconsin Institute for Discovery UW-Madison

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: