Lifelong Bandit Optimization No Prior and No Regret Felix Schur 1Parnian Kassraie1Jonas Rothfuss1Andreas Krause1 1ETH Zurich Switzerland

2025-05-02 0 0 1.39MB 35 页 10玖币
侵权投诉
Lifelong Bandit Optimization: No Prior and No Regret
Felix Schur*1Parnian Kassraie*1Jonas Rothfuss1Andreas Krause1
1ETH Zurich, Switzerland
Abstract
Machine learning algorithms are often repeatedly
applied to problems with similar structure over
and over again. We focus on solving a sequence
of bandit optimization tasks and develop LIBO,
an algorithm which adapts to the environment
by learning from past experience and becomes
more sample-efficient in the process. We assume
a kernelized structure where the kernel is unknown
but shared across all tasks. LIBO sequentially
meta-learns a kernel that approximates the true
kernel and solves the incoming tasks with the
latest kernel estimate. Our algorithm can be paired
with any kernelized or linear bandit algorithm and
guarantees oracle optimal performance, meaning
that as more tasks are solved, the regret of LIBO
on each task converges to the regret of the bandit
algorithm with oracle knowledge of the true
kernel. Naturally, if paired with a sublinear bandit
algorithm, LIBO yields a sublinear lifelong regret.
We also show that direct access to the data from
each task is not necessary for attaining sublinear
regret. We propose F-LIBO, which solves the
lifelong problem in a federated manner.
1 INTRODUCTION
A key aspect of human intelligence is our ability to harness
previous experience and quickly improve when repeatedly
solving similar problems. In this paper, we study how to
solve a sequence of learning problems, on related instances,
and become more efficient in the process. In particular we
focus on problems which are solved through Bayesian Opti-
mization, a.k.a. kernelized bandit algorithms (BO), where
the kernel captures regularity structure of the tasks. A mo-
tivating application are AutoML systems, which perform
*Equal contribution.
hyper-parameter tuning for the same model on different
datasets, or different models on the same dataset. We expect
that the more tasks our machine learning system solves, the
better the system becomes at solving the next one.
We model this as lifelong learning, where an agent
sequentially faces kernelized bandit problems with different
unknown reward functions. While prior work assumes the
kernel to be known (e.g., hand-designed), we consider the
kernel
k
to be unknown, but shared between the problem
instances. After each bandit task, we use the previously
collected data to meta-learn a kernel function
ˆ
k
as a proxy
for the unknown
k
. We transfer knowledge across tasks by
sequentially updating the meta-learned kernel and using it to
solve the next task. This way, we adapt to the environment
and gradually improve the bandit performance. Ideally,
we would like to reach the oracle-optimal performance,
i.e. the performance of a bandit algorithm with complete
knowledge of the environment.
Lifelong bandit optimization is a delicate problem for two
reasons. First, the success of each round of BO depends
on the validity of the meta-learned kernel: We only have
guaranteed convergence and sublinear regret if the reproduc-
ing kernel Hilbert space (RKHS), induced by the estimated
kernel
ˆ
k
, contains the reward functions. Second, the data
that is used for meta-learning is collected at the previous
BO tasks. Thus, during each BO round, we not only have
to quickly find reward maximizing actions, but also have to
collect exploratory data that is sufficiently informative for
successful meta-learning of the kernel.
We address these challenges when the true kernel is a sparse
convex combination of a large number of candidate kernels.
We propose an approach for meta-learning a provably
consistent estimator of the true kernel, given data from
previous tasks (Theorem 3.3). To ensure that this data is
sufficiently informative, we interlace the queries of the
BO agent with purely exploratory queries. Combining
these two key ideas, we design our main algorithm, the
Lifelong Bandit Optimizer (LIBO). This algorithm is
Accepted for the 39th Conference on Uncertainty in Artificial Intelligence (UAI 2023).
arXiv:2210.15513v3 [stat.ML] 20 Jun 2023
versatile since it is agnostic to the bandit policy, i.e. it
can be wrapped around any kernelized or linear base
bandit algorithm to influence its policy and satisfy lifelong
guarantees. We prove that it is oracle-optimal, i.e. that by
using LIBO, we can eventually achieve the same worst-case
performance as the base bandit algorithm which has oracle
knowledge of the true kernel (Theorem 4.1). We do not
make assumptions about the base bandit algorithm, and our
convergence guarantees hold for many bandit solvers such
as OFUL [Abbasi-Yadkori et al.,2011], GP-UCB [Srinivas
et al.,2010] or GP-TS [Chowdhury and Gopalan,2017].
Additionally, we consider a federated setting where each
BO task is performed by a client node in a network and the
data ought not to be exchanged with the server node due to
privacy concerns. We propose the Federated Lifelong Bandit
Optimizer (F-LIBO), and show that it satisfies a guarantee
similar to LIBO (Theorem 5.1). If we take GP-UCB as
base bandit solver, LIBO and F-LIBO have the same
worst-case regret bound rates as the GP-UCB solver when
given oracle knowledge of the true kernel (Corollary 4.2
and 5.2). In Section 6we support our theoretical findings by
experiments on synthetic and real-world data in the AutoML
context. Lastly, we discuss related works in Section 7.
2 PROBLEM STATEMENT
We consider a lifelong optimization setting, where an agent
interacts with a sequence of black-box optimization prob-
lems, arriving one after another. Throughout the sequence of
optimization tasks, the agent can adapt to the environment
based on the previously collected data and improve its
performance on the succeeding tasks. Formally, the agent
iteratively faces bandit problems with unknown reward func-
tions
f1, ..., fm
residing in a RKHS
Hk
that corresponds
to an unknown kernel function
k
. To impose regularity,
we assume that the reward function has a bounded kernel
norm
fkB
and that the domain
X Rd0
is compact.
The agent interacts with each task
fs
for
n
time steps. For
each task
s= 1, . . . , m
, at time step
i= 1, . . . , n
, the agent
selects an action
xs,i ∈ X
and receives a stochastic reward
via
ys,i =fs(xs,i) + εs,i
. Here,
εs,i
are i.i.d. samples from
a zero-mean sub-Gaussian noise with variance proxy
σ2
.
The goal of the agent is to maximize its rewards across all
tasks. This can be formalized as minimizing the lifelong
regret over mtasks of size n, defined as
R(m, n):=
m
X
s=1
n
X
i=1
fs(x
s)fs(xs,i),
where
x
s
is a global maximum of
fs
. If
R(m,n)
mn 0
as
m, n → ∞
then the agent eventually converges to the global
optimum of each upcoming optimization task. This property
is commonly referred to as sublinearity of the regret. To
attain a small regret, the agent maintains an estimate of the
unknown reward function
fs
based on its history. Typically,
a kernelized regression oracle (e.g. kernel ridge regression
or Gaussian Processes) is employed for this task. The choice
of the kernel function plays a key role in the success and
data-efficiency of the bandit optimization. If the hypothesis
space
Hk
induced by the kernel
k
is too restrictive and
does not contain the true reward functions
fs
, the agent will
likely never find reward maximizing actions. To prevent
this, most practitioners pick a kernel with a conservatively
complex kernel with a large hypothesis space that is very
likely to contain
Hk
. However, the larger
Hk
, the more
observations it takes to form a good reward estimate,
making the finding an optimal solution less efficient.
We take a data-driven approach to select the kernel. In
particular, we aim to sequentially meta-learn a kernel
ˆ
k
which approximates the true kernel
k
, using the data from
previous bandit tasks. Let
Ds:={(xs,i, ys,i)in}
be the
data corresponding to task
s
, and
D1:s:=D1 ··· Ds
be
the collection of datasets from the first
s
tasks. Then once
the agent solves task
s
, we pass
D1:s
to the meta-agent,
who meta-learns a kernel
ˆ
ks
. This kernel is then provided
to the agent who uses it for solving the next BO task. Our
meta-learning algorithm can be paired with any kernelized
bandit algorithm and achieves sublinear lifelong regret, if
the bandit algorithm achieves sublinear single-task regret
given oracle knowledge of the kernel.
3 META-LEARNING KERNELS
We first present Meta Kernelized Group Lasso (META-
KGL), our approach to estimate the kernel
k
, given data
from previous tasks. We consider a large set of eligible
known base kernels
{k1, . . . , kp}
where
kj:X × X R
for all
j= 1, . . . , p
and
kj(x,x)1
for all
x,x∈ X
without loss of generality. We assume that while
k
is
unknown, it is a sparse linear combination of kernels se-
lected from this set, i.e., there exists
J⊂ {1, . . . , p}
and
α1,...αpRsuch that
k(x,x) = X
jJ
αjkj(x,x),
where
|J| ≪ p
. The set
{k1, . . . , kp}
can be very large,
since we prove that the cost of finding
J
depends only
logarithmically on
p
. We further assume that each
kj
corre-
sponds to a
dj
-dimensional feature map, i.e.,
kj(x,x) =
ϕj(x)Tϕj(x)
, where
ϕjRdj
and
dj<
. This set-
ting generalizes the common linear bandit assumption, to
also account for higher-order terms and interaction between
coordinates of the input. Let
ϕ(x)
denote the concate-
nated
d
-dimensional feature map, where
d:=Pp
j=1 dj
and
ϕ(x):= [ϕT
j(x)]jp.
Then for
s= 1, . . . , m
the reward
functions can be written as
fs(·) = Pp
j=1 ϕ
j(·)β
s(j)
such
that
β
s(j)= 0
for all
j /J
. Moreover, the RKHS norm of
fs
will be equal to
fs2
k=Pp
j=1 β
s(j)2
2
. This kernel
model is inspired by Kassraie et al. [2022], who assume
k
lies in the convex cone of the base kernels.
In the lifelong setting, we sequentially form kernel estimates
ˆ
ks
based on
D1:s
for
s= 1, . . . , m
. In this section, we
consider one snapshot of this process for
s=m
, where we
have fixed meta-training data
D1:m
and meta-learn
ˆ
k:=ˆ
km
.
We assume without loss of generality (c.f. Appendix C),
k(x,x) = 1
|J|X
jJ
kj(x,x),
and minimize a sparsity inducing loss which allows us to
discard kernels that do not appear in the above formulation.
META-KGL first minimizes L(β;D1:m)over βRmd.
L(β;D1:m):=1
|D1:m|yΦβ2
2+λ
p
X
j=1 β(j)2(1)
=1
mn
m
X
s=1
n
X
i=1 ys,i
p
X
j=1
ϕT
j(xs,i)βs(j)2
+λ
p
X
j=1 v
u
u
t
m
X
s=1 βs(j)2
2
The vectorized formulation uses the following notation
y:=h[y1,i]in,...,[ym,i]iniRmn,
β:=h[β1(j)]jp,...,[βm(j)]jpiRmd,
β(j):=hβ1(j),...,βm(j)iRmdj,
Φ:= diag(Φ1, . . . Φm)Rmn×dm.
Here
Φs:= (ϕ(xs,1),...,ϕ(xs,n))
is the
n×d
fea-
ture matrix of a task
s
, and therefore
Φ
denotes a block
diagonal matrix which gathers the features across all tasks.
This meta-loss function is convex, and is equivalent to the
well-known Group Lasso objective [Lounici et al.,2011].
Therefore, it can be efficiently optimized using Group Lasso
solvers [e.g., Massias et al.,2018] and enjoys the statistical
properties of the Group Lasso, e.g., consistency and variable
selection. Let
ˆ
β:= arg min L(β;D1:m)
. The first term in
Eq. (1) represents the squared prediction error of
ˆ
β
, while
the second is a regularization term that induces group spar-
sity in
ˆ
β
. Mainly, the solutions
ˆ
β= ( ˆ
β(1),..., ˆ
β(p))
to this
problem are group sparse, i.e.
ˆ
β(j)= 0
for many of the in-
dices
j∈ {1, . . . , p}
. META-KGL then constructs the set of
plausible kernels
ˆ
J
, by thresholding
ˆ
β(j)2
and discarding
the kernels that do no appear to be influencing the data, i.e.,
ˆ
J:=nj|j∈ {1, . . . , p}s.t. ˆ
β(j)2> ωmo.
where
ω > 0
is a hyperparamter of the algorithm. We then
construct the estimated kernel as
ˆ
k(x,x):=1
|ˆ
J|X
jˆ
J
k(x,x).
META-KGL is summarized in Algorithm 1. Under mild
assumptions on the dataset, we can show that
ˆ
k
converges
to the true kernel
k
in probability. Our first assumption
ensures that if
jJ
, i.e.,
kj
is active in the true kernel,
then the contribution of
kj
to the data is large enough to
be statistically detectable under noise.
Assumption 3.1 (Beta-min).There exists
c1>0
such that
for all jJ,
β(j)2c1m.
This assumption is commonly used in the high-dimensional
statistics literature [Bühlmann and Van De Geer,2011,
Bunea et al.,2013,Zhao and Yu,2006]. Our second as-
sumption requires that the meta-training data is sufficiently
diverse. In Proposition 4.3, we propose a policy which
provably satisfies this assumption.
Assumption 3.2 (Sufficiently Informative Data).The fea-
ture matrix
ΦRmn×d
is sufficiently informative if there
exists a constant cκ>0such that κ(Φ)cκwhere
κ(Φ):= inf
(J,b)
1
nΦb2
PjJb(j)2
s.t. bRd\{0},X
j /Jb(j)23X
jJb(j)2,
J⊂ {1, . . . , p},|J|≤|J|.
Intuitively,
κ(Φ)
measures the quality of the data: data
points that are almost identical decrease
κ(Φ)
, and
κ(Φ)
is large when data points are diverse. If the minimum
eigenvalue of
Φ
is positive, Assumption 3.2 is automatically
fulfilled. This type of assumption is common in the
literature on representation/meta-learning for sequential
decision-making [Yang et al.,2021,Cella and Pontil,2021,
Kassraie et al.,2022] and sparse linear bandits [Bastani
and Bayati,2020,Hao et al.,2020,Kim and Paik,2019].
It is also known in the Lasso literature as the compatibility
condition [Bühlmann and Van De Geer,2011]. Given these
assumptions, we show that META-KGL recovers the true
kernel with high probability.
Theorem 3.3 (Consistency of META-KGL).Suppose
D1:m
satisfies Assumption 3.1 and 3.2 with constants
c1
and
cκ
respectively. Set
ω(0, c1)
and define
¯ω= min{ω, c1
ω}
. Choose
λ= ¯ωc2
κ/(8m)
. Then for
n > 32σ/(¯ωc2
κ)
,
META-KGL satisfies
Phˆ
J=Ji1pexp m¯ωc2
κn
32σ12.
In particular, ˆ
Jis a consistent estimator both in nand m,
lim
n→∞
Phˆ
J=Ji= 1,and lim
m→∞
Phˆ
J=Ji= 1.
Appendix Dpresents the proof to Theorem 3.3. This the-
orem shows that our meta-learned kernel converges to
k
as the number of meta-training tasks increases. First, this
implies that the meta-learned hypothesis space includes the
unknown reward functions allowing downstream bandit al-
gorithms to provably converge to the optimum. Second, all
candidate kernels
kj
that are not active in
k
are eventually
excluded from
ˆ
k
. By excluding all
kj
with
j /J
which are
not necessary for estimating
fs∈ Hk
, we effectively shrink
the size of the hypothesis space, thereby reducing the uncer-
tainty of the reward function estimates during bandit opti-
mization. Compared to
kfull :=1
pPp
j=1 kj
, which naively
uses all kernels, this leads to significant improvements in the
query efficiency and performance of the bandit optimization.
Comparison with Prior Work. Kassraie et al. [2022]
propose META-KEL, a Lasso-equivalent loss for meta-
learning a sparse kernel, given i.i.d. offline data from
i.i.d. tasks. We emphasize that is not possible to achieve
lifelong guarantees by sequentially applying this algorithm.
META-KGL differs from META-KEL in key points, and
satisfies stronger consistency guarantees: 1) It converges to
k
as either
n
the number of samples per task, or
m
number
of tasks grow. In contrast, META-KEL converges in
m
only.
2) META-KGL satisfies the exact recovery guarantee for
k
since
J=ˆ
J
with high probability. While META-KEL only
guarantees that
Jˆ
J
. This is not sufficient to show that
meta-learning improves upon the trivial kernel choice
kfull
.
Both of these properties are required in the lifelong analysis.
4 LIFELONG BANDIT OPTIMIZATION
We now use META-KGL as a building block to develop the
Lifelong Bandit Optimizer (LIBO), an algorithm for lifelong
bandit or Bayesian optimization. LIBO is paired with a
BASEBO agent which can be instantiated by any kernelized
bandit algorithm, e.g., GP-UCB [Srinivas et al.,2010] or
GP-TS [Chowdhury and Gopalan,2017]. For each task
fs
,
the BASEBO agent is given the kernel
ˆ
ks1
meta-learned
on the
s1
first tasks. Equipped with the kernel, BASEBO
interacts with the current bandit environment, aiming to op-
timize its payoff by balancing exploration and exploitation.
In the lifelong setting, we not only have to explore for the
sake of optimizing the current reward function
fs
, but also
we need to make sure to that the sequence of action-reward
pairs will be sufficiently informative (in the sense of As-
sumption 3.2) for meta-learning
ˆ
ks
in the next stage. To
this end, LIBO forces the base agent to select purely ex-
ploratory actions for the first
ns
steps of the task, by i.i.d.
sampling from uniform distribution on
X
. Following Basu
et al. [2021], we refer to this as forced exploration and use
Dexp
s:={(xs,i, ys,i), i ns}
to refer to the collected ex-
ploratory data of task
fs
. We use a decreasing sequence
(n1, . . . , nm)
as detailed below, since less exploration by
BASEBO will be required once more multi-task data is col-
lected. For steps
i>ns
, BASEBO selects actions according
to its normal bandit policy. After the agent has interacted
BASEBO
(i>ns)Environment
Forced Exploration
(ins)
META-KGL
ˆ
ks1
xs,i
fs(xs,i) + ϵs,i
Dexp
s
Dexp
1,Dexp
2,. . . ,Dexp
s1
Figure 1: Overview of LIBO.
with the current task for
n
steps, we pass the exploratory
data
Dexp
1:s
to META-KGL to meta-learn
ˆ
ks
. We then an-
nounce this new kernel estimate to the BASEBO agent for
solving the next task
s+ 1
. Figure 1visualizes this process
and Algorithm 2summarizes LIBO.
4.1 REGRET BOUNDS
Let
R(n)
be the worst-case regret of BASEBO with oracle
knowledge of true kernel
k
on single tasks when the
reward resides in
Hk
. When employed sequentially on
m
bandit tasks, the worst-case lifelong regret
R(m, n)
will
be of the order
mR(n)
with high probability. We refer to
this as oracle regret, since the
BASEBO
has access to the
true kernel
k
which does not hold in practice. Since our
meta-learned kernels
ˆ
ks
are an approximations of
k
, the
oracle regret is a natural lower bound on the regret of LIBO.
In the following, we show that if
R(n)
the single-task
oracle regret of the base bandit algorithm is sublinear
(e.g., as for GP-UCB or GP-TS), then so is the lifelong
regret
R(m, n)
of LIBO. Importantly,
R(m, n)
is not only
sublinear in
n
, but also converges with high probability to
R(m, n)
. Theorem 4.1 presents this guarantee, assuming
that the forced exploration datasets
Dexp
s
satisfy assumption
Assumption 3.2 which META-KGL requires to yield a
provably consistent estimator of
k
. Later in Proposition 4.3,
we show that exploration by i.i.d. sampling from a uniform
distribution over Xwill guarantee this assumption.
Theorem 4.1. For all tasks
s= 1, . . . , m
, assume that
the reward function
fs∈ Hk
has bounded RKHS norm
fskB
. Set the number of forced exploration actions
as
ns=n
s1/4
, and assume that Assumption 3.1 and 3.2
hold for the data
Dexp
1:s
for all
s= 1, . . . , m
. Suppose, with
probability greater than
1δ/2
,BASEBO has worst-case
oracle regret
R(m, n)
. Then, the lifelong regret of LIBO
satisfies
R(m, n)R(m, n) = OBm3/4n
| {z }
forced exp.
+B(nm)1/3log3/4(mp/δ)
| {z }
kernel mismatch
with probability greater than 1δ.
The explicit inequality without the
O
-notation can be found
in Appendix E, together with the proof. In the following, we
give a sketch of the proof, aiming to explain the source of
each term in the bound. For every forced exploration step, in
the worst-case, we suffer regret of
2B
. When accumulated
over a total of
Pm
s=1 ns
such steps, this gives the first term
in the bound. If
ˆ
ks̸=k
, it is possible to suffer from linear
regret in the worse-case. To account for this, we calculate
the smallest integer
m0
, for which, with high probability,
ˆ
ks=k
for all
m0< s m
. Based on Theorem 3.3,
we show that
m0=O((m/n2)1/3log3/4(mp/δ))
. For ev-
ery task
sm0
we suffer a linear regret of
2Bnm0
in the
worst-case. This is upper bounded by the second term in The-
orem 4.1, which can be regarded as the cost of learning
J
.
Notably, it grows only logarithmically with
p
the number of
considered features/kernels, offering a significant improve-
ment about the polynomial rates given by prior works [e.g.,
Yang et al.,2021,Hong et al.,2022]. Table 2and Table 3
present a comprehensive list of the related regret bounds.
We highlight that the excess regret of LIBO in Theorem 4.1
is sublinear in both
m
and
n
. This implies that the algorithm
is oracle optimal, meaning that as
m→ ∞
, the single-task
regret without knowledge of
k
, eventually approaches the
oracle single-task regret. Recall that
R(m, n) = mR(n)
and therefore,
R(m, n)/m R(n)
. This guarantee is
stronger than that of [Basu et al.,2021,Peleg et al.,2022],
where the excess regret depends linearly on
m
due to
excessive forced exploration. By decreasing
nss1/4
the number of exploratory steps vanishes throughout the
sequence of tasks.
As an example, we analyze the performance if GP-UCB
1
[Srinivas et al.,2010] is used as the BASEBO algorithm.
In this case, we demonstrate that the worst-case lifelong
regret of LIBO is of the same rate as the correspond-
ing oracle regret. To highlight the benefit of this oracle
optimality we compare to a naive baseline which uses
ˆ
ks=kfull =Pp
j=1 1
pkj
for all tasks instead of meta-
learning
ˆ
ks
sequentially. In particular, we consider solving
a sequence of
m
tasks in three scenarios: 1) running LIBO
paired with GP-UCB 2) repeatedly running GP-UCB with
oracle access to
k
, and 3) repeatedly running GP-UCB
with
kfull
. The following corollary shows that the worst-case
upper bound for the first two scenarios match in
O
-notation.
Corollary 4.2 (Lifelong GP-UCB).Consider the setting of
Theorem 4.1 with GP-UCB as BASEBO agent. Then, with
probability at least
1δ
, the lifelong regret of LIBO paired
with GP-UCB satisfies
R(m, n) = OR(m, n)
=OBmdnlog n
d+mqndlog n
dlog 1
δ
where d:=PjJdj.
1Appendix E.1 provides a background on GP-UCB.
In the third scenario, we conservatively set
ˆ
ks=kfull
for
all
s= 1, . . . , m
. While this is sufficient for attaining a
lifelong regret that is sublinear in
n
, the performance will
not be oracle optimal. In particular, this algorithm suffers
from a regret of
R(m, n) = OBmdp
|J|nlog n
d+qnd log n
dlog 1
δ
where
d=Pp
j=1 djd
and
p/|J|
can be very large.
Our experiments confirm that the performance of the naive
approach is significantly worse than the other variants. This
is due to the fact that confidence bounds constructed using
kfull
tend to contract slower than the ones constructed with
the sparse meta-learned ˆ
ks.
4.2 FORCED EXPLORATION
Our forced exploration scheme ensures that the collected
data is sufficiently informative to guarantee successful meta-
learning. From a technical perspective, it ensures that As-
sumption 3.2 is met and allows for a consistent estimator of
k
. The cost of this exploration in the regret of each task is
smaller in rate than the minimax regret bound [Lattimore
and Szepesvári,2020]. Therefore it has only a negligible
effect on the overall performance guarantees of LIBO (see
Corollary 4.2). We show that by uniformly drawing actions
from the domain, the collected data satisfies this assumption:
Proposition 4.3. Assume that
ϕjL2(X)
,
j∈ {1, . . . , p}
are orthonormal and let
dj= 1
. Draw
x1,...,xn1
inde-
pendently and uniformly from
X
, and repeatedly use them to
construct
Dexp
1:s
. Then with probability at least
1δ
,
Dexp
1:s
satisfies Assumption 3.2, for s= 1, . . . , m.
The proof can be found in Appendix E.3. The
dj= 1
condition is met without loss of generality, by splitting
the higher dimensional feature maps and introducing
more base features, which will increase
p
. Moreover,
the orthonormality condition is met by orthogonalizing
and re-scaling the feature maps. Basis functions such as
Legendre polynomials and Fourier features [Rahimi et al.,
2007] satisfy these conditions.
Generally, it is natural to require BASEBO to explore more
in lifelong setting compared to when it is used in isolation
and with a known kernel. We observe in our experiments
that LIBO has a better empirical performance with forced
exploration (i.e.,
ns>0
) than without. This additional
exploration is also required in the Representation Learning
[Yang et al.,2021,2022,Cella and Pontil,2021,Cella et al.,
2022] and hierarchical Bayesian bandit literature [Basu
et al.,2021,Peleg et al.,2022,Hong et al.,2022], where
it is assumed that either the context distribution or the
chosen actions are diverse enough. In the case of contextual
bandits, if there is sufficient randomness, the BASEBO can
摘要:

LifelongBanditOptimization:NoPriorandNoRegretFelixSchur*1ParnianKassraie*1JonasRothfuss1AndreasKrause11ETHZurich,SwitzerlandAbstractMachinelearningalgorithmsareoftenrepeatedlyappliedtoproblemswithsimilarstructureoverandoveragain.WefocusonsolvingasequenceofbanditoptimizationtasksanddevelopLIBO,analgo...

展开>> 收起<<
Lifelong Bandit Optimization No Prior and No Regret Felix Schur 1Parnian Kassraie1Jonas Rothfuss1Andreas Krause1 1ETH Zurich Switzerland.pdf

共35页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:35 页 大小:1.39MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 35
客服
关注