over a time horizon,
h
, by considering the influence of the choice of
xt:t+h−1
on
˜
ft+h
through
multistep lookahead [
8
,
9
,
10
]. Such methods promise improved query efficiency – needing fewer
configuration evaluations to achieve threshold performance – and improved performance of the model
after all evaluations are considered when compared to myopic approaches. As recent ML models
like pre-trained language and vision models become larger and thus both more expensive and time
consuming to train [11, 12, 13], the above promises have become more enticing.
Despite these promises, planning-based HPO has not been widely adopted. This is because suggested
approaches have been both technically complex to implement and computationally intensive to
execute. In this paper, we circumvent both these challenges by leveraging recent advances in
Transformer based [
14
], natural language interfaced HPO. Specifically, Chen et al.
[2]
recently
proposed an approach dubbed OptFormer which presents a unified, natural language interface for
hyperparameter optimization and thus allows transfer learning between HPO problems. Relevant to
us is the structure of the OptFormer model: it jointly performs both hyperparameter suggestion
xt
and function prediction ˜
ft(xt)as autoregressive generation.
By leveraging the ease and speed of autoregressive generation with OptFormer, we present a novel
planning-based HPO algorithm that is simple and scalable; thus bypassing the typical challenges of
traditional methods. We incorporate multi-step lookahead into both candidate generation (instead
of doing a global search as typical of most HPO approaches) and candidate evaluation (we pro-
vide a planning-based acquisition function
α
) since performing rollouts with OptFormer is simply
multistep generation for
h
timesteps. In Section 4, we present our method in detail and in Section
6 we empirically validate our approach under two settings: the BBOB dataset [
15
], which is a
synthetic benchmark, and the RealWorldData dataset [
2
], which is constructed from aggregating
actual hyperparameter tuning experiments. Our results highlight that incorporating planning into
transformer-based hyperparameter optimization models is a promising research direction and we
invite the community to explore this setting further.
2 Related Work
As already discussed, most approaches to hyperparameter optimization leverage Bayesian optimiza-
tion (BO). These methods typically build a Gaussian process (GP) model [
5
,
16
] as the surrogate
˜
ft(x)
and proceed to suggest
xt≈argmaxx∈X α(x;˜
ft)
where
α
is an acquisition function. An
acquisition function measures the expected utility after one additional evaluation of
f
at
x
. Past work
has explored a variety of acquisition functions such as probability of improvement (PI) [
5
], expected
improvement (EI) [
17
,
18
] and upper confidence bound (UCB) [
19
,
20
]. These acquisition functions
are myopic since they depend solely on the surrogate at the current timestep and so do not directly
account for the impact of a greedy choice on downstream evaluations.
Unlike myopic HPO methods, planning based approaches fundamentally require building models of
the future to assess the impact of a current decision on later timesteps. Though these methods also
rely on a GP as a surrogate model, each point in multi-step planning involves fantasizing/imagining an
updated GP posterior
˜
ft+1 |˜
xt,...,˜
ft+h|˜
xt,˜
xt+1,..., ˜
xt+h−1
based on simulated choices from
lookaheads
{(˜
xt,˜yt),...,(˜
xt+h−1,˜yt+h−1)}
[
21
,
22
]. Note that we use
˜
xt
to represent a fantasized
decision, while
xt
is the actual choice made at timestep
t
. Whilst multi-step planning is promising,
constructing the posterior of a GP model requires matrix inversion which is a compute-intensive
operation [
23
]. Even outside of this limitation, traditional planning based approaches are compute
intensive due to (i) poor scaling behavior of the search tree—
O(qh)
where
q
is the number of choices
at each decision point for each lookahead step [
9
,
21
]—which forces most methods to explore
short horizons, typically
h∈ {1,2}
, and (ii) nested expectation and maximization: marginalizing
future observation
˜yt+j, j < h
and global search on the acquisition function to obtain query
˜
xt+j
at every lookahead step. Another issue is complexity of implementation. In an attempt to reduce
the computational burdens described above, complex approaches have been suggested that require
significant expertise to execute properly. For example Jiang et al.
[22]
attempt to overcome the
problem of nested expectation and maximization by leveraging the reparameterization trick [
24
] to
couple sampling and optimization. They build a differentiable objective over a multi-step tree and
then optimize over this objective to get a proposal for the next suggestion
xt
; a far from simple
method. Rollout policy [
25
] provides a tractable alternative to approximate the nested expectation by
2