
The explicit inequality without the
O
-notation can be found
in Appendix E, together with the proof. In the following, we
give a sketch of the proof, aiming to explain the source of
each term in the bound. For every forced exploration step, in
the worst-case, we suffer regret of
2B
. When accumulated
over a total of
Pm
s=1 ns
such steps, this gives the first term
in the bound. If
ˆ
ks̸=k⋆
, it is possible to suffer from linear
regret in the worse-case. To account for this, we calculate
the smallest integer
m0
, for which, with high probability,
ˆ
ks=k⋆
for all
m0< s ≤m
. Based on Theorem 3.3,
we show that
m0=O((m/n2)1/3log3/4(mp/δ))
. For ev-
ery task
s≤m0
we suffer a linear regret of
2Bnm0
in the
worst-case. This is upper bounded by the second term in The-
orem 4.1, which can be regarded as the cost of learning
J⋆
.
Notably, it grows only logarithmically with
p
the number of
considered features/kernels, offering a significant improve-
ment about the polynomial rates given by prior works [e.g.,
Yang et al.,2021,Hong et al.,2022]. Table 2and Table 3
present a comprehensive list of the related regret bounds.
We highlight that the excess regret of LIBO in Theorem 4.1
is sublinear in both
m
and
n
. This implies that the algorithm
is oracle optimal, meaning that as
m→ ∞
, the single-task
regret without knowledge of
k⋆
, eventually approaches the
oracle single-task regret. Recall that
R⋆(m, n) = mR⋆(n)
and therefore,
R(m, n)/m →R⋆(n)
. This guarantee is
stronger than that of [Basu et al.,2021,Peleg et al.,2022],
where the excess regret depends linearly on
m
due to
excessive forced exploration. By decreasing
ns∝s−1/4
the number of exploratory steps vanishes throughout the
sequence of tasks.
As an example, we analyze the performance if GP-UCB
1
[Srinivas et al.,2010] is used as the BASEBO algorithm.
In this case, we demonstrate that the worst-case lifelong
regret of LIBO is of the same rate as the correspond-
ing oracle regret. To highlight the benefit of this oracle
optimality we compare to a naive baseline which uses
ˆ
ks=kfull =Pp
j=1 1
pkj
for all tasks instead of meta-
learning
ˆ
ks
sequentially. In particular, we consider solving
a sequence of
m
tasks in three scenarios: 1) running LIBO
paired with GP-UCB 2) repeatedly running GP-UCB with
oracle access to
k∗
, and 3) repeatedly running GP-UCB
with
kfull
. The following corollary shows that the worst-case
upper bound for the first two scenarios match in
O
-notation.
Corollary 4.2 (Lifelong GP-UCB).Consider the setting of
Theorem 4.1 with GP-UCB as BASEBO agent. Then, with
probability at least
1−δ
, the lifelong regret of LIBO paired
with GP-UCB satisfies
R(m, n) = OR⋆(m, n)
=OBmd⋆√nlog n
d⋆+mqnd⋆log n
d⋆log 1
δ
where d⋆:=Pj∈J⋆dj.
1Appendix E.1 provides a background on GP-UCB.
In the third scenario, we conservatively set
ˆ
ks=kfull
for
all
s= 1, . . . , m
. While this is sufficient for attaining a
lifelong regret that is sublinear in
n
, the performance will
not be oracle optimal. In particular, this algorithm suffers
from a regret of
R(m, n) = OBmdp
|J⋆|√nlog n
d+qnd log n
dlog 1
δ
where
d=Pp
j=1 dj≫d⋆
and
p/|J⋆|
can be very large.
Our experiments confirm that the performance of the naive
approach is significantly worse than the other variants. This
is due to the fact that confidence bounds constructed using
kfull
tend to contract slower than the ones constructed with
the sparse meta-learned ˆ
ks.
4.2 FORCED EXPLORATION
Our forced exploration scheme ensures that the collected
data is sufficiently informative to guarantee successful meta-
learning. From a technical perspective, it ensures that As-
sumption 3.2 is met and allows for a consistent estimator of
k⋆
. The cost of this exploration in the regret of each task is
smaller in rate than the minimax regret bound [Lattimore
and Szepesvári,2020]. Therefore it has only a negligible
effect on the overall performance guarantees of LIBO (see
Corollary 4.2). We show that by uniformly drawing actions
from the domain, the collected data satisfies this assumption:
Proposition 4.3. Assume that
ϕj∈L2(X)
,
j∈ {1, . . . , p}
are orthonormal and let
dj= 1
. Draw
x1,...,xn1
inde-
pendently and uniformly from
X
, and repeatedly use them to
construct
Dexp
1:s
. Then with probability at least
1−δ
,
Dexp
1:s
satisfies Assumption 3.2, for s= 1, . . . , m.
The proof can be found in Appendix E.3. The
dj= 1
condition is met without loss of generality, by splitting
the higher dimensional feature maps and introducing
more base features, which will increase
p
. Moreover,
the orthonormality condition is met by orthogonalizing
and re-scaling the feature maps. Basis functions such as
Legendre polynomials and Fourier features [Rahimi et al.,
2007] satisfy these conditions.
Generally, it is natural to require BASEBO to explore more
in lifelong setting compared to when it is used in isolation
and with a known kernel. We observe in our experiments
that LIBO has a better empirical performance with forced
exploration (i.e.,
ns>0
) than without. This additional
exploration is also required in the Representation Learning
[Yang et al.,2021,2022,Cella and Pontil,2021,Cella et al.,
2022] and hierarchical Bayesian bandit literature [Basu
et al.,2021,Peleg et al.,2022,Hong et al.,2022], where
it is assumed that either the context distribution or the
chosen actions are diverse enough. In the case of contextual
bandits, if there is sufficient randomness, the BASEBO can