
the parameters
(B, ρ)
are related to the power of the oracle and
n
is the number of linear constraints.
The case B= 1 coincides with known results.
Experimental results:
We benchmark both
FPML
and
OGhybrid
on an online black-box hyper-
parameter optimization problem based on the 2020 NeurIPS BBO challenge [Turner et al., 2021].
We find that both these new algorithms outperform
OG
for various compute budgets. We are able
to explain why this happens for this specific dataset, and discuss the scenarios under which each
algorithm would perform better.
Techniques:
Minimizing
R∗
T
is an important subroutine for a large variety of applications including
Linear Programming, Boosting, and solving zero sum games [Arora et al., 2012]. Traditionally an
experts algorithm such as
Hedge
[Littlestone and Warmuth, 1994], which pulls a single arm per
round, will be used as a subroutine to minimize
R∗
T
. We highlight how in the cases of OSFM and
Linear Programming, one can simply replace a single arm
R∗
T
-minimizing subroutine with
FPML
and get performance bounds with little or no alteration to the original proofs. The resulting algorithms
have improved bounds (due to improved bounds on
R∗
T
when
B > 1
) at the cost of qualitatively
changing the algorithm (e.g. requiring a larger budget or more powerful oracle). This is significant
because it highlights the potential of how bounds on
R∗
T
when
B > 1
can lead to new results in other
application areas. In Section 2.1 we also highlight how the proof techniques of Kalai and Vempala
[2005] for bounding
R∗
T
in the traditional experts setting can naturally be generalized to the case
when B > 1, which is of independent interest.
1.2 Relation to prior work
One can alternatively formulate more gewe consider as receiving the maximum reward
rt(a) =
1−ct(a)
of each arm chosen instead of the minimum cost. In this maximum of rewards formulation,
the problem fits within the OSFM framework where (a) all actions are unit-time and (b) the submodular
job function is always a maximum of rewards. The rewards formulation of the problem has also
been separately studied as the K-MAX problem (here
K=B
) Chen et al. [2016a]. In the OSFM
setting, Streeter and Golovin [2008] give an online greedy approximation algorithm which guarantees
E[(1 −e−1)OPT(B)−RewardT]≤ O(pT B ln(N))
in the full feedback adversarial setting, where
OPT(B)
is the cumulative reward of the best fixed subset of
B
arms in hindsight, and
RewardT
is the
cumulative reward of the algorithm. A similar bound of
O(BpT N ln(N))
can be given in a semi-
feedback setting. Conversely in the full feedback setting, Streeter and Golovin [2007] shows that any
algorithm has worst-case regret
E[OPT(B)−RewardT]≥Ω(pT B ln(N/B))
when one receives
the maximum of rewards in each round. Chen et al. [2016a] study the K-MAX problem and other
non-linear reward functions in the stochastic combinatorial multi-armed bandit setting. Assuming the
rewards satisfy certain distributional assumptions, they give an algorithm which achieves distribution-
independent regret bounds of
E[(1 −ε)OPT(B)−RewardT]≤ O(pT BN ln(T))
for
ε > 0
with
semi-bandit feedback. Note however that we consider the adversarial setting in this paper.
More broadly, these problems fall within the combinatorial online learning setting where an algorithm
may pull a subset of arms in each round. Much prior work has focused on combinatorial bandits
where the reward is linear in the subset of arms chosen, which can model applications including online
advertising and online shortest paths [Cesa-Bianchi and Lugosi, 2012, Audibert et al., 2014, Combes
et al., 2015]. The case of non-linear reward is comparatively less studied, but having non-linear
rewards (such as max) allows one to model a wider variety of problems including online expected
utility maximization [Li and Deshpande, 2011, Chen et al., 2016a]. As some examples of prior work
in the stochastic setting, [Gopalan et al., 2014] uses Thompson Sampling to deal with non-linear
rewards of functions of subsets of arms (including the max function), but requires the rewards to
come from a known parametric distribution. Chen et al. [2016b] considers a model where the subset
of arms pulled is randomized based on pulling a ‘super-arm’, and the reward is a non-linear function
of the values of the arms pulled. In the adversarial setting, Han et al. [2021] study the combinatorial
MAB problem when rewards can be expressed as a d-degree polynomial.
In contrast to prior work which focuses on giving algorithms which compete against benchmarks
which have the same budget as the algorithm, this work is concerned with the trade-off between
regret bounds and budget size. We focus on giving regret bounds against OPT
(1)
, and we use this
result in Section 3 to get regret bounds against
OPT(B0)
for
B0< B
in OSFM. Decoupling the regret
3