Lifelong Bandit Optimization No Prior and No Regret Felix Schur 1Parnian Kassraie1Jonas Rothfuss1Andreas Krause1 1ETH Zurich Switzerland

2025-05-02 0 0 1.39MB 35 页 10玖币

侵权投诉

Lifelong Bandit Optimization: No Prior and No Regret

Felix Schur*1Parnian Kassraie*1Jonas Rothfuss1Andreas Krause1

1ETH Zurich, Switzerland

Abstract

Machine learning algorithms are often repeatedly

applied to problems with similar structure over

and over again. We focus on solving a sequence

of bandit optimization tasks and develop LIBO,

an algorithm which adapts to the environment

by learning from past experience and becomes

more sample-efﬁcient in the process. We assume

a kernelized structure where the kernel is unknown

but shared across all tasks. LIBO sequentially

meta-learns a kernel that approximates the true

kernel and solves the incoming tasks with the

latest kernel estimate. Our algorithm can be paired

with any kernelized or linear bandit algorithm and

guarantees oracle optimal performance, meaning

that as more tasks are solved, the regret of LIBO

on each task converges to the regret of the bandit

algorithm with oracle knowledge of the true

kernel. Naturally, if paired with a sublinear bandit

algorithm, LIBO yields a sublinear lifelong regret.

We also show that direct access to the data from

each task is not necessary for attaining sublinear

regret. We propose F-LIBO, which solves the

lifelong problem in a federated manner.

1 INTRODUCTION

A key aspect of human intelligence is our ability to harness

previous experience and quickly improve when repeatedly

solving similar problems. In this paper, we study how to

solve a sequence of learning problems, on related instances,

and become more efﬁcient in the process. In particular we

focus on problems which are solved through Bayesian Opti-

mization, a.k.a. kernelized bandit algorithms (BO), where

the kernel captures regularity structure of the tasks. A mo-

tivating application are AutoML systems, which perform

*Equal contribution.

hyper-parameter tuning for the same model on different

datasets, or different models on the same dataset. We expect

that the more tasks our machine learning system solves, the

better the system becomes at solving the next one.

We model this as lifelong learning, where an agent

sequentially faces kernelized bandit problems with different

unknown reward functions. While prior work assumes the

kernel to be known (e.g., hand-designed), we consider the

kernel

k∗

to be unknown, but shared between the problem

instances. After each bandit task, we use the previously

collected data to meta-learn a kernel function

as a proxy

for the unknown

k⋆

. We transfer knowledge across tasks by

sequentially updating the meta-learned kernel and using it to

solve the next task. This way, we adapt to the environment

and gradually improve the bandit performance. Ideally,

we would like to reach the oracle-optimal performance,

i.e. the performance of a bandit algorithm with complete

knowledge of the environment.

Lifelong bandit optimization is a delicate problem for two

reasons. First, the success of each round of BO depends

on the validity of the meta-learned kernel: We only have

guaranteed convergence and sublinear regret if the reproduc-

ing kernel Hilbert space (RKHS), induced by the estimated

kernel

, contains the reward functions. Second, the data

that is used for meta-learning is collected at the previous

BO tasks. Thus, during each BO round, we not only have

to quickly ﬁnd reward maximizing actions, but also have to

collect exploratory data that is sufﬁciently informative for

successful meta-learning of the kernel.

We address these challenges when the true kernel is a sparse

convex combination of a large number of candidate kernels.

We propose an approach for meta-learning a provably

consistent estimator of the true kernel, given data from

previous tasks (Theorem 3.3). To ensure that this data is

sufﬁciently informative, we interlace the queries of the

BO agent with purely exploratory queries. Combining

these two key ideas, we design our main algorithm, the

Lifelong Bandit Optimizer (LIBO). This algorithm is

Accepted for the 39th Conference on Uncertainty in Artiﬁcial Intelligence (UAI 2023).

arXiv:2210.15513v3 [stat.ML] 20 Jun 2023

versatile since it is agnostic to the bandit policy, i.e. it

can be wrapped around any kernelized or linear base

bandit algorithm to inﬂuence its policy and satisfy lifelong

guarantees. We prove that it is oracle-optimal, i.e. that by

using LIBO, we can eventually achieve the same worst-case

performance as the base bandit algorithm which has oracle

knowledge of the true kernel (Theorem 4.1). We do not

make assumptions about the base bandit algorithm, and our

convergence guarantees hold for many bandit solvers such

as OFUL [Abbasi-Yadkori et al.,2011], GP-UCB [Srinivas

et al.,2010] or GP-TS [Chowdhury and Gopalan,2017].

Additionally, we consider a federated setting where each

BO task is performed by a client node in a network and the

data ought not to be exchanged with the server node due to

privacy concerns. We propose the Federated Lifelong Bandit

Optimizer (F-LIBO), and show that it satisﬁes a guarantee

similar to LIBO (Theorem 5.1). If we take GP-UCB as

base bandit solver, LIBO and F-LIBO have the same

worst-case regret bound rates as the GP-UCB solver when

given oracle knowledge of the true kernel (Corollary 4.2

and 5.2). In Section 6we support our theoretical ﬁndings by

experiments on synthetic and real-world data in the AutoML

context. Lastly, we discuss related works in Section 7.

2 PROBLEM STATEMENT

We consider a lifelong optimization setting, where an agent

interacts with a sequence of black-box optimization prob-

lems, arriving one after another. Throughout the sequence of

optimization tasks, the agent can adapt to the environment

based on the previously collected data and improve its

performance on the succeeding tasks. Formally, the agent

iteratively faces bandit problems with unknown reward func-

tions

f1, ..., fm

residing in a RKHS

Hk⋆

that corresponds

to an unknown kernel function

k⋆

. To impose regularity,

we assume that the reward function has a bounded kernel

norm

∥f∥k⋆≤B

and that the domain

X ⊂ Rd0

is compact.

The agent interacts with each task

for

time steps. For

each task

s= 1, . . . , m

, at time step

i= 1, . . . , n

, the agent

selects an action

xs,i ∈ X

and receives a stochastic reward

via

ys,i =fs(xs,i) + εs,i

. Here,

εs,i

are i.i.d. samples from

a zero-mean sub-Gaussian noise with variance proxy

σ2

The goal of the agent is to maximize its rewards across all

tasks. This can be formalized as minimizing the lifelong

regret over mtasks of size n, deﬁned as

R(m, n):=

s=1

i=1

fs(x⋆

s)−fs(xs,i),

where

x⋆

is a global maximum of

. If

R(m,n)

mn →0

m, n → ∞

then the agent eventually converges to the global

optimum of each upcoming optimization task. This property

is commonly referred to as sublinearity of the regret. To

attain a small regret, the agent maintains an estimate of the

unknown reward function

based on its history. Typically,

a kernelized regression oracle (e.g. kernel ridge regression

or Gaussian Processes) is employed for this task. The choice

of the kernel function plays a key role in the success and

data-efﬁciency of the bandit optimization. If the hypothesis

space

induced by the kernel

is too restrictive and

does not contain the true reward functions

, the agent will

likely never ﬁnd reward maximizing actions. To prevent

this, most practitioners pick a kernel with a conservatively

complex kernel with a large hypothesis space that is very

likely to contain

Hk∗

. However, the larger

, the more

observations it takes to form a good reward estimate,

making the ﬁnding an optimal solution less efﬁcient.

We take a data-driven approach to select the kernel. In

particular, we aim to sequentially meta-learn a kernel

which approximates the true kernel

k⋆

, using the data from

previous bandit tasks. Let

Ds:={(xs,i, ys,i)i≤n}

be the

data corresponding to task

, and

D1:s:=D1∪ ···∪ Ds

the collection of datasets from the ﬁrst

tasks. Then once

the agent solves task

, we pass

D1:s

to the meta-agent,

who meta-learns a kernel

. This kernel is then provided

to the agent who uses it for solving the next BO task. Our

meta-learning algorithm can be paired with any kernelized

bandit algorithm and achieves sublinear lifelong regret, if

the bandit algorithm achieves sublinear single-task regret

given oracle knowledge of the kernel.

3 META-LEARNING KERNELS

We ﬁrst present Meta Kernelized Group Lasso (META-

KGL), our approach to estimate the kernel

k⋆

, given data

from previous tasks. We consider a large set of eligible

known base kernels

{k1, . . . , kp}

where

kj:X × X → R

for all

j= 1, . . . , p

and

kj(x,x′)≤1

for all

x,x′∈ X

without loss of generality. We assume that while

k⋆

unknown, it is a sparse linear combination of kernels se-

lected from this set, i.e., there exists

J⋆⊂ {1, . . . , p}

and

α1,...αp∈Rsuch that

k⋆(x,x′) = X

j∈J⋆

αjkj(x,x′),

where

|J⋆| ≪ p

. The set

{k1, . . . , kp}

can be very large,

since we prove that the cost of ﬁnding

J⋆

depends only

logarithmically on

. We further assume that each

corre-

sponds to a

-dimensional feature map, i.e.,

kj(x,x′) =

ϕj(x)Tϕj(x′)

, where

ϕj∈Rdj

and

dj<∞

. This set-

ting generalizes the common linear bandit assumption, to

also account for higher-order terms and interaction between

coordinates of the input. Let

ϕ(x)

denote the concate-

nated

-dimensional feature map, where

d:=Pp

j=1 dj

and

ϕ(x):= [ϕT

j(x)]j≤p.

Then for

s= 1, . . . , m

the reward

functions can be written as

fs(·) = Pp

j=1 ϕ⊤

j(·)β⋆

s(j)

such

that

β⋆

s(j)= 0

for all

j /∈J⋆

. Moreover, the RKHS norm of

will be equal to

∥fs∥2

k⋆=Pp

j=1 ∥β⋆

s(j)∥2

. This kernel

model is inspired by Kassraie et al. [2022], who assume

k⋆

lies in the convex cone of the base kernels.

In the lifelong setting, we sequentially form kernel estimates

based on

D1:s

for

s= 1, . . . , m

. In this section, we

consider one snapshot of this process for

s=m

, where we

have ﬁxed meta-training data

D1:m

and meta-learn

k:=ˆ

We assume without loss of generality (c.f. Appendix C),

k⋆(x,x′) = 1

|J⋆|X

j∈J⋆

kj(x,x′),

and minimize a sparsity inducing loss which allows us to

discard kernels that do not appear in the above formulation.

META-KGL ﬁrst minimizes L(β;D1:m)over β∈Rmd.

L(β;D1:m):=1

|D1:m|∥y−Φβ∥2

2+λ

j=1 ∥β(j)∥2(1)

s=1

i=1 ys,i −

j=1

ϕT

j(xs,i)βs(j)2

+λ

j=1 v

s=1 ∥βs(j)∥2

The vectorized formulation uses the following notation

y:=h[y1,i]i≤n,...,[ym,i]i≤ni∈Rmn,

β:=h[β1(j)]j≤p,...,[βm(j)]j≤pi∈Rmd,

β(j):=hβ1(j),...,βm(j)i∈Rmdj,

Φ:= diag(Φ1, . . . Φm)∈Rmn×dm.

Here

Φs:= (ϕ⊤(xs,1),...,ϕ⊤(xs,n))⊤

is the

n×d

fea-

ture matrix of a task

, and therefore

denotes a block

diagonal matrix which gathers the features across all tasks.

This meta-loss function is convex, and is equivalent to the

well-known Group Lasso objective [Lounici et al.,2011].

Therefore, it can be efﬁciently optimized using Group Lasso

solvers [e.g., Massias et al.,2018] and enjoys the statistical

properties of the Group Lasso, e.g., consistency and variable

selection. Let

β:= arg min L(β;D1:m)

. The ﬁrst term in

Eq. (1) represents the squared prediction error of

, while

the second is a regularization term that induces group spar-

sity in

. Mainly, the solutions

β= ( ˆ

β(1),..., ˆ

β(p))

to this

problem are group sparse, i.e.

β(j)= 0

for many of the in-

dices

j∈ {1, . . . , p}

. META-KGL then constructs the set of

plausible kernels

, by thresholding

∥ˆ

β(j)∥2

and discarding

the kernels that do no appear to be inﬂuencing the data, i.e.,

J:=nj|j∈ {1, . . . , p}s.t. ∥ˆ

β(j)∥2> ω√mo.

where

ω > 0

is a hyperparamter of the algorithm. We then

construct the estimated kernel as

k(x,x′):=1

|ˆ

J|X

j∈ˆ

k(x,x′).

META-KGL is summarized in Algorithm 1. Under mild

assumptions on the dataset, we can show that

converges

to the true kernel

k⋆

in probability. Our ﬁrst assumption

ensures that if

j∈J⋆

, i.e.,

is active in the true kernel,

then the contribution of

to the data is large enough to

be statistically detectable under noise.

Assumption 3.1 (Beta-min).There exists

c1>0

such that

for all j∈J⋆,

∥β∗(j)∥2≥c1√m.

This assumption is commonly used in the high-dimensional

statistics literature [Bühlmann and Van De Geer,2011,

Bunea et al.,2013,Zhao and Yu,2006]. Our second as-

sumption requires that the meta-training data is sufﬁciently

diverse. In Proposition 4.3, we propose a policy which

provably satisﬁes this assumption.

Assumption 3.2 (Sufﬁciently Informative Data).The fea-

ture matrix

Φ∈Rmn×d

is sufﬁciently informative if there

exists a constant cκ>0such that κ(Φ)≥cκwhere

κ(Φ):= inf

(J,b)

√n∥Φb∥2

Pj∈J∥b(j)∥2

s.t. b∈Rd\{0},X

j /∈J∥b(j)∥2≤3X

j∈J∥b(j)∥2,

J⊂ {1, . . . , p},|J|≤|J⋆|.

Intuitively,

κ(Φ)

measures the quality of the data: data

points that are almost identical decrease

κ(Φ)

, and

κ(Φ)

is large when data points are diverse. If the minimum

eigenvalue of

is positive, Assumption 3.2 is automatically

fulﬁlled. This type of assumption is common in the

literature on representation/meta-learning for sequential

decision-making [Yang et al.,2021,Cella and Pontil,2021,

Kassraie et al.,2022] and sparse linear bandits [Bastani

and Bayati,2020,Hao et al.,2020,Kim and Paik,2019].

It is also known in the Lasso literature as the compatibility

condition [Bühlmann and Van De Geer,2011]. Given these

assumptions, we show that META-KGL recovers the true

kernel with high probability.

Theorem 3.3 (Consistency of META-KGL).Suppose

D1:m

satisﬁes Assumption 3.1 and 3.2 with constants

and

cκ

respectively. Set

ω∈(0, c1)

and deﬁne

¯ω= min{ω, c1−

ω}

. Choose

λ= ¯ωc2

κ/(8√m)

. Then for

√n > 32σ/(¯ωc2

κ)

META-KGL satisﬁes

Phˆ

J=J⋆i≥1−pexp −m¯ωc2

κ√n

32σ−12.

In particular, ˆ

Jis a consistent estimator both in nand m,

lim

n→∞

Phˆ

J=J⋆i= 1,and lim

m→∞

Phˆ

J=J⋆i= 1.

Appendix Dpresents the proof to Theorem 3.3. This the-

orem shows that our meta-learned kernel converges to

k⋆

as the number of meta-training tasks increases. First, this

implies that the meta-learned hypothesis space includes the

unknown reward functions allowing downstream bandit al-

gorithms to provably converge to the optimum. Second, all

candidate kernels

that are not active in

k⋆

are eventually

excluded from

. By excluding all

with

j /∈J⋆

which are

not necessary for estimating

fs∈ Hk⋆

, we effectively shrink

the size of the hypothesis space, thereby reducing the uncer-

tainty of the reward function estimates during bandit opti-

mization. Compared to

kfull :=1

pPp

j=1 kj

, which naively

uses all kernels, this leads to signiﬁcant improvements in the

query efﬁciency and performance of the bandit optimization.

Comparison with Prior Work. Kassraie et al. [2022]

propose META-KEL, a Lasso-equivalent loss for meta-

learning a sparse kernel, given i.i.d. ofﬂine data from

i.i.d. tasks. We emphasize that is not possible to achieve

lifelong guarantees by sequentially applying this algorithm.

META-KGL differs from META-KEL in key points, and

satisﬁes stronger consistency guarantees: 1) It converges to

k⋆

as either

the number of samples per task, or

number

of tasks grow. In contrast, META-KEL converges in

only.

2) META-KGL satisﬁes the exact recovery guarantee for

k⋆

since

J∗=ˆ

with high probability. While META-KEL only

guarantees that

J⋆⊂ˆ

. This is not sufﬁcient to show that

meta-learning improves upon the trivial kernel choice

kfull

Both of these properties are required in the lifelong analysis.

4 LIFELONG BANDIT OPTIMIZATION

We now use META-KGL as a building block to develop the

Lifelong Bandit Optimizer (LIBO), an algorithm for lifelong

bandit or Bayesian optimization. LIBO is paired with a

BASEBO agent which can be instantiated by any kernelized

bandit algorithm, e.g., GP-UCB [Srinivas et al.,2010] or

GP-TS [Chowdhury and Gopalan,2017]. For each task

the BASEBO agent is given the kernel

ks−1

meta-learned

on the

s−1

ﬁrst tasks. Equipped with the kernel, BASEBO

interacts with the current bandit environment, aiming to op-

timize its payoff by balancing exploration and exploitation.

In the lifelong setting, we not only have to explore for the

sake of optimizing the current reward function

, but also

we need to make sure to that the sequence of action-reward

pairs will be sufﬁciently informative (in the sense of As-

sumption 3.2) for meta-learning

in the next stage. To

this end, LIBO forces the base agent to select purely ex-

ploratory actions for the ﬁrst

steps of the task, by i.i.d.

sampling from uniform distribution on

. Following Basu

et al. [2021], we refer to this as forced exploration and use

Dexp

s:={(xs,i, ys,i), i ≤ns}

to refer to the collected ex-

ploratory data of task

. We use a decreasing sequence

(n1, . . . , nm)

as detailed below, since less exploration by

BASEBO will be required once more multi-task data is col-

lected. For steps

i>ns

, BASEBO selects actions according

to its normal bandit policy. After the agent has interacted

BASEBO

(i>ns)Environment

Forced Exploration

(i≤ns)

META-KGL

ks−1

xs,i

fs(xs,i) + ϵs,i

Dexp

1,Dexp

2,. . . ,Dexp

s−1

Figure 1: Overview of LIBO.

with the current task for

steps, we pass the exploratory

data

Dexp

1:s

to META-KGL to meta-learn

. We then an-

nounce this new kernel estimate to the BASEBO agent for

solving the next task

s+ 1

. Figure 1visualizes this process

and Algorithm 2summarizes LIBO.

4.1 REGRET BOUNDS

Let

R⋆(n)

be the worst-case regret of BASEBO with oracle

knowledge of true kernel

k⋆

on single tasks when the

reward resides in

Hk⋆

. When employed sequentially on

bandit tasks, the worst-case lifelong regret

R(m, n)

will

be of the order

mR⋆(n)

with high probability. We refer to

this as oracle regret, since the

BASEBO

has access to the

true kernel

k∗

which does not hold in practice. Since our

meta-learned kernels

are an approximations of

k∗

, the

oracle regret is a natural lower bound on the regret of LIBO.

In the following, we show that if

R⋆(n)

the single-task

oracle regret of the base bandit algorithm is sublinear

(e.g., as for GP-UCB or GP-TS), then so is the lifelong

regret

R(m, n)

of LIBO. Importantly,

R(m, n)

is not only

sublinear in

, but also converges with high probability to

R⋆(m, n)

. Theorem 4.1 presents this guarantee, assuming

that the forced exploration datasets

Dexp

satisfy assumption

Assumption 3.2 which META-KGL requires to yield a

provably consistent estimator of

k⋆

. Later in Proposition 4.3,

we show that exploration by i.i.d. sampling from a uniform

distribution over Xwill guarantee this assumption.

Theorem 4.1. For all tasks

s= 1, . . . , m

, assume that

the reward function

fs∈ Hk⋆

has bounded RKHS norm

∥fs∥k⋆≤B

. Set the number of forced exploration actions

ns=√n

s1/4

, and assume that Assumption 3.1 and 3.2

hold for the data

Dexp

1:s

for all

s= 1, . . . , m

. Suppose, with

probability greater than

1−δ/2

,BASEBO has worst-case

oracle regret

R⋆(m, n)

. Then, the lifelong regret of LIBO

satisﬁes

R(m, n)−R⋆(m, n) = OBm3/4√n

| {z }

forced exp.

+B(nm)1/3log3/4(mp/δ)

| {z }

kernel mismatch 

with probability greater than 1−δ.

The explicit inequality without the

-notation can be found

in Appendix E, together with the proof. In the following, we

give a sketch of the proof, aiming to explain the source of

each term in the bound. For every forced exploration step, in

the worst-case, we suffer regret of

. When accumulated

over a total of

s=1 ns

such steps, this gives the ﬁrst term

in the bound. If

ks̸=k⋆

, it is possible to suffer from linear

regret in the worse-case. To account for this, we calculate

the smallest integer

, for which, with high probability,

ks=k⋆

for all

m0< s ≤m

. Based on Theorem 3.3,

we show that

m0=O((m/n2)1/3log3/4(mp/δ))

. For ev-

ery task

s≤m0

we suffer a linear regret of

2Bnm0

in the

worst-case. This is upper bounded by the second term in The-

orem 4.1, which can be regarded as the cost of learning

J⋆

Notably, it grows only logarithmically with

the number of

considered features/kernels, offering a signiﬁcant improve-

ment about the polynomial rates given by prior works [e.g.,

Yang et al.,2021,Hong et al.,2022]. Table 2and Table 3

present a comprehensive list of the related regret bounds.

We highlight that the excess regret of LIBO in Theorem 4.1

is sublinear in both

and

. This implies that the algorithm

is oracle optimal, meaning that as

m→ ∞

, the single-task

regret without knowledge of

k⋆

, eventually approaches the

oracle single-task regret. Recall that

R⋆(m, n) = mR⋆(n)

and therefore,

R(m, n)/m →R⋆(n)

. This guarantee is

stronger than that of [Basu et al.,2021,Peleg et al.,2022],

where the excess regret depends linearly on

due to

excessive forced exploration. By decreasing

ns∝s−1/4

the number of exploratory steps vanishes throughout the

sequence of tasks.

As an example, we analyze the performance if GP-UCB

[Srinivas et al.,2010] is used as the BASEBO algorithm.

In this case, we demonstrate that the worst-case lifelong

regret of LIBO is of the same rate as the correspond-

ing oracle regret. To highlight the beneﬁt of this oracle

optimality we compare to a naive baseline which uses

ks=kfull =Pp

j=1 1

pkj

for all tasks instead of meta-

learning

sequentially. In particular, we consider solving

a sequence of

tasks in three scenarios: 1) running LIBO

paired with GP-UCB 2) repeatedly running GP-UCB with

oracle access to

k∗

, and 3) repeatedly running GP-UCB

with

kfull

. The following corollary shows that the worst-case

upper bound for the ﬁrst two scenarios match in

-notation.

Corollary 4.2 (Lifelong GP-UCB).Consider the setting of

Theorem 4.1 with GP-UCB as BASEBO agent. Then, with

probability at least

1−δ

, the lifelong regret of LIBO paired

with GP-UCB satisﬁes

R(m, n) = OR⋆(m, n)

=OBmd⋆√nlog n

d⋆+mqnd⋆log n

d⋆log 1

δ

where d⋆:=Pj∈J⋆dj.

1Appendix E.1 provides a background on GP-UCB.

In the third scenario, we conservatively set

ks=kfull

for

all

s= 1, . . . , m

. While this is sufﬁcient for attaining a

lifelong regret that is sublinear in

, the performance will

not be oracle optimal. In particular, this algorithm suffers

from a regret of

R(m, n) = OBmdp

|J⋆|√nlog n

d+qnd log n

dlog 1

δ

where

d=Pp

j=1 dj≫d⋆

and

p/|J⋆|

can be very large.

Our experiments conﬁrm that the performance of the naive

approach is signiﬁcantly worse than the other variants. This

is due to the fact that conﬁdence bounds constructed using

kfull

tend to contract slower than the ones constructed with

the sparse meta-learned ˆ

ks.

4.2 FORCED EXPLORATION

Our forced exploration scheme ensures that the collected

data is sufﬁciently informative to guarantee successful meta-

learning. From a technical perspective, it ensures that As-

sumption 3.2 is met and allows for a consistent estimator of

k⋆

. The cost of this exploration in the regret of each task is

smaller in rate than the minimax regret bound [Lattimore

and Szepesvári,2020]. Therefore it has only a negligible

effect on the overall performance guarantees of LIBO (see

Corollary 4.2). We show that by uniformly drawing actions

from the domain, the collected data satisﬁes this assumption:

Proposition 4.3. Assume that

ϕj∈L2(X)

j∈ {1, . . . , p}

are orthonormal and let

dj= 1

. Draw

x1,...,xn1

inde-

pendently and uniformly from

, and repeatedly use them to

construct

Dexp

1:s

. Then with probability at least

1−δ

Dexp

1:s

satisﬁes Assumption 3.2, for s= 1, . . . , m.

The proof can be found in Appendix E.3. The

dj= 1

condition is met without loss of generality, by splitting

the higher dimensional feature maps and introducing

more base features, which will increase

. Moreover,

the orthonormality condition is met by orthogonalizing

and re-scaling the feature maps. Basis functions such as

Legendre polynomials and Fourier features [Rahimi et al.,

2007] satisfy these conditions.

Generally, it is natural to require BASEBO to explore more

in lifelong setting compared to when it is used in isolation

and with a known kernel. We observe in our experiments

that LIBO has a better empirical performance with forced

exploration (i.e.,

ns>0

) than without. This additional

exploration is also required in the Representation Learning

[Yang et al.,2021,2022,Cella and Pontil,2021,Cella et al.,

2022] and hierarchical Bayesian bandit literature [Basu

et al.,2021,Peleg et al.,2022,Hong et al.,2022], where

it is assumed that either the context distribution or the

chosen actions are diverse enough. In the case of contextual

bandits, if there is sufﬁcient randomness, the BASEBO can

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LifelongBanditOptimization:NoPriorandNoRegretFelixSchur*1ParnianKassraie*1JonasRothfuss1AndreasKrause11ETHZurich,SwitzerlandAbstractMachinelearningalgorithmsareoftenrepeatedlyappliedtoproblemswithsimilarstructureoverandoveragain.WefocusonsolvingasequenceofbanditoptimizationtasksanddevelopLIBO,analgo...

展开>> 收起<<

Lifelong Bandit Optimization No Prior and No Regret Felix Schur 1Parnian Kassraie1Jonas Rothfuss1Andreas Krause1 1ETH Zurich Switzerland.pdf

共35页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Lifelong Bandit Optimization No Prior and No Regret Felix Schur 1Parnian Kassraie1Jonas Rothfuss1Andreas Krause1 1ETH Zurich Switzerland

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: