Monte Carlo Tree Search based Variable Selection for High Dimensional Bayesian Optimization Lei Song Ke Xue Xiaobin Huang Chao Qiany

2025-05-02 0 0 1.56MB 28 页 10玖币
侵权投诉
Monte Carlo Tree Search based Variable Selection
for High Dimensional Bayesian Optimization
Lei Song
, Ke Xue
, Xiaobin Huang, Chao Qian
State Key Laboratory for Novel Software Technology,
Nanjing University, Nanjing 210023, China
{songl, xuek, huangxb, qianc}@lamda.nju.edu.cn
Abstract
Bayesian optimization (BO) is a class of popular methods for expensive black-box
optimization, and has been widely applied to many scenarios. However, BO suffers
from the curse of dimensionality, and scaling it to high-dimensional problems is still
a challenge. In this paper, we propose a variable selection method MCTS-VS based
on Monte Carlo tree search (MCTS), to iteratively select and optimize a subset of
variables. That is, MCTS-VS constructs a low-dimensional subspace via MCTS and
optimizes in the subspace with any BO algorithm. We give a theoretical analysis of
the general variable selection method to reveal how it can work. Experiments on
high-dimensional synthetic functions and real-world problems (i.e., NAS-bench
problems and MuJoCo locomotion tasks) show that MCTS-VS equipped with a
proper BO optimizer can achieve state-of-the-art performance.
1 Introduction
In many real-world tasks such as neural architecture search (NAS) [
41
] and policy search in rein-
forcement learning (RL) [
6
], one often needs to solve the expensive black-box optimization problems.
Bayesian optimization (BO) [
2
,
11
,
23
,
32
] is a sample-efficient algorithm for solving such problems.
It iteratively fits a surrogate model, typically Gaussian process (GP), and maximizes an acquisition
function to obtain the next point to evaluate. While BO has been employed in a wide variety of
settings, successful applications are often limited to low-dimensional problems.
Recently, scaling BO to high-dimensional problems has received a lot of interest. Decomposition-
based methods [
13
,
15
,
17
,
26
,
31
] assume that the high-dimensional function to be optimized has
a certain structure, typically the additive structure. By decomposing the original high-dimensional
function into the sum of several low-dimensional functions, they optimize each low-dimensional
function to obtain the point in the high-dimensional space. However, it is not easy to decide whether
a decomposition exists as well as to learn the decomposition.
Other methods often assume that the original high-dimensional function with dimension
D
has
a low-dimensional subspace with dimension
dD
, and then perform the optimization in the
low-dimensional subspace and project the low-dimensional point back for evaluation. For example,
embedding-based methods [
20
,
27
,
42
] use a random matrix to embed the original space into the low-
dimensional subspace. Another way is to select a subset of variables directly, which can even avoid
the time-consuming matrix operations of embedding-based methods. For example, Dropout [
21
]
selects
d
variables randomly in each iteration. Note that for both embedding and variable selection
methods, the parameter
d
can have a large influence on the performance, which is, however, difficult
to set in real-world problems.
Equal Contribution
Corresponding Author
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.01628v2 [cs.LG] 16 Nov 2022
In this paper, we propose a new Variable Selection method using Monte Carlo Tree Search (MCTS),
called MCTS-VS. MCTS is employed to partition the variables into important and unimportant ones,
and only those selected important variables are optimized via any black-box optimization algorithm,
e.g., vanilla BO [
32
] or TuRBO [
10
]. The values of unimportant variables are sampled using historical
information. Compared with Dropout-BO, MCTS-VS can select important variables automatically.
We also provide regret and computational complexity analyses of general variable selection methods,
showing that variable selection can reduce the computational complexity while increasing the
cumulative regret. Our regret bound generalizes that of GP-UCB [
38
] which always selects all
variables, as well as that of Dropout [
21
] which selects
d
variables randomly in each iteration. The
results suggest that a good variable selection method should select as important variables as possible.
Experiments on high-dimensional synthetic functions and real-world problems (i.e., NAS and RL
problems) show that MCTS-VS is better than the previous variable selection method Dropout [
21
],
and can also achieve the competitive performance to state-of-the-art BO algorithms. Furthermore, its
running time is small due to the advantage of variable selection. We also observe that MCTS-VS can
select important variables, explaining its good performance based on our theoretical analysis.
2 Background
2.1 Bayesian Optimization
We consider the problem
maxx∈X f(x)
, where
f
is a black-box function and
X RD
is the domain.
The basic framework of BO contains two critical components: a surrogate model and an acquisition
function. GP is the most popular surrogate model. Given the sampled data points
{(xi, yi)}t1
i=1
,
where
yi=f(xi) + i
and
i N (0, η2)
is the observation noise, GP at iteration
t
seeks to infer
f∼ GP(µ(·), k(·,·) + η2I)
, specified by the mean
µ(·)
and covariance kernel
k(·,·)
, where
I
is
the identity matrix of size
D
. After that, an acquisition function, e.g., Probability of Improvement
(PI) [
19
], Expected Improvement (EI) [
16
] or Upper Confidence Bound (UCB) [
38
], is used to
determine the next query point xtwhile balancing exploitation and exploration.
2.2 High-dimensional Bayesian Optimization
Scaling BO to high-dimensional problems is a challenge due to the curse of dimensionality and the
computation cost. As the dimension increases, the search space increases exponentially, requiring
more samples, and thus more expensive evaluations, to find a good solution. Furthermore, the
computation cost of updating the GP model and optimizing the acquisition function will be very
time-consuming [
30
]. There have been a few common approaches to tackle high-dimensional BO
with different assumptions.
Decomposition.
Assuming that the function can be decomposed into the sum of low-dimensional
functions with disjoint subspaces, Kandasamy et al.
[17]
proposed the Add-GP-UCB algorithm to
optimize those low-dimensional functions separately, which was further generalized to overlapping
subspaces [
26
,
31
]. Wang et al.
[43]
proposed ensemble BO that uses an ensemble of additive GP
models for scalability. Han et al.
[13]
constrained the dependency graphs of decomposition to tree
structures to facilitate the decomposition learning and optimization. For most problems, however, the
decomposition is unknown, and also difficult to learn.
Embedding.
Assuming that only a few dimensions affect the high-dimensional function significantly,
embedding-based methods embed the high-dimensional space into a low-dimensional subspace, and
optimize in the subspace while projecting the point back for evaluation. REMBO and its variants use
a random matrix to embed the search space into a low-dimensional subspace [
3
,
4
,
42
]. Nayebi et al.
[27]
used a hash-based method for embedding. Letham et al.
[20]
proposed ALEBO, focusing on
several misconceptions in REMBO to improve the performance. The VAE-based approaches were
also employed to project a structured input space (e.g., graphs and images) to a low-dimensional
subspace [12, 22].
Variable Selection.
Based on the same assumption as embedding, variable selection methods
iteratively select a subset of variables to build a low-dimensional subspace and optimize through
BO. The selected variables can be viewed as important variables that are valuable for exploitation,
or having high uncertainty that are valuable for exploration. A classical method is Dropout [
21
],
2
which randomly chooses
d
variables in each iteration. Spagnol et al.
[37]
uses Hilbert Schmidt
Independence criterion to guide variable selection. When evaluating the sampled point, the values
of those unselected variables are obtained by random sampling or using historical information.
VS-BO [
33
] selects variables with larger estimated gradients and uses CMA-ES [
14
] to obtain the
values of unselected variables. Note that variable selection can be faster than embedding, because the
embedding cost (e.g., matrix inversion) is time-consuming for high-dimensional optimization.
Both embedding and variable selection methods need to specify the parameter
d
, i.e., the dimension
of low-dimensional subspace, which will affect the performance significantly, but is not easy to set.
There are also some methods to improve the basic components of BO directly for high-dimensional
problems. For example, DNGO [
36
] uses the neural network as an alternative of GP to speed up
inference; BO-PP [
29
] generates pseudo-points (i.e., data points whose objective values are not
evaluated) to improve the GP model; SAASBO [
9
] uses sparsity-inducing prior to perform variable
selection implicitly, making the coefficients of unimportant variables near to zero and thus restraining
over-exploration on these variables. Note that different from Dropout and our proposed MCTS-VS,
SAASBO still optimizes all variables, and also due to its high computational cost of inference, it is
very time-consuming as reported in [
9
]. These methods can be combined with the above-mentioned
dimensionality reduction methods, which may bring further improvement.
2.3 Monte Carlo Tree Search
MCTS [
5
] is a tree search algorithm based on random sampling, and has shown great success in
high-dimensional tasks, such as Go [
34
,
35
]. A tree node represents a state, describing the current
situation, e.g., the position in path planning. Each tree node
X
stores a value
vX
representing its
goodness, and the number nXthat it has been visited. They are used to calculate UCB [1], i.e.,
vX+ 2Cpq2(log np)/nX,(1)
where
Cp
is a hyper-parameter, and
np
is the number of visits of the parent of
X
. UCB considers
both exploitation and exploration, and will be used for node selection.
MCTS iteratively selects a leaf node of the tree for expansion. Each iteration can be divided into four
steps: selection,expansion,simulation and back-propagation. Starting from the root node, selection
is to recursively select a node with larger UCB until a leaf node, denoted as
X
. Expansion is to
execute a certain action in the state represented by
X
and transfer to the next state, e.g., move forward
and arrive at a new position in path planning. We use the child node
Y
of
X
to represent the next
state. Simulation is to obtain the value
vY
via random sampling. Back-propagation is to update the
value and the number of visits of Ys ancestors.
To tackle high-dimensional optimization, Wang et al.
[40]
proposed LA-MCTS, which applies MCTS
to iteratively partition the search space into small sub-regions, and optimizes only in the good sub-
regions. That is, the root of the tree represents the entire search space
, and each tree node
X
represents a sub-region
X
. The value
vX
is measured by the average objective value of the sampled
points in the sub-region
X
. In each iteration, after selecting a leaf node
X
, LA-MCTS performs
the optimization in
X
by vanilla BO [
32
] or TuRBO [
10
], and the sampled points are used for
clustering and classification to bifurcate
X
into two disjoint sub-regions, which are “good” and
“bad”, respectively. Note that the sub-regions are generated by dividing the range of variables, and
their dimensionality does not decrease, which is still the number of all variables. Wang et al.
[40]
have empirically shown the good performance of LA-MCTS. However, as the dimension increases,
the search space increases exponentially, and more partitions and evaluations are required to find a
good solution, making the application of LA-MCTS to high-dimensional optimization still limited.
3 MCTS-VS Method
In this section, we propose a Variable Selection method based on MCTS for high-dimensional
BO, briefly called MCTS-VS. The main idea is to apply MCTS to iteratively partition all variables
into important and unimportant ones, and perform BO only for those important variables. Let
[D] = {1,2, . . . , D}
denote the indexes of all variables
x
, and
xM
denote the subset of variables
indexed by M[D].
We first introduce a
D
-dimensional vector named variable score, which is a key component of
MCTS-VS. Its
i
-th element represents the importance of the
i
-th variable
xi
. During the running
3
process of MCTS-VS, after optimizing a subset
xM
of variables where
M[D]
denotes the indexes
of the variables, a set
D
of sampled points will be generated, and the pair
(M,D)
will be recorded
into a set D, called information set. The variable score vector is based on D, and calculated as
s=
X
(M,D)DX
(xi,yi)∈D
yi·g(M)
X
(M,D)D
|D| · g(M)
,(2)
where the function
g: 2[D]→ {0,1}D
gives the Boolean vector representation of a variable index
subset
M[D]
(i.e., the
i
-th element of
g(M)
is
1
if
iM
, and 0 otherwise), and
/
is the element-
wise division. Each dimension of
P(M,D)DP(xi,yi)∈D yi·g(M)
is the sum of query evaluations
using each variable, and each dimension of
P(M,D)D|D| · g(M)
is the number of queries using
each variable. Thus, the
i
-th element of variable score
s
, representing the importance of the
i
-th
variable
xi
, is actually measured by the average goodness of all the sampled points that are generated
by optimizing a subset of variables containing
xi
. The variable score
s
will be used to define the
value of each tree node of MCTS as well as for node expansion.
In MCTS-VS, the root of the tree represents all variables. A tree node
X
represents a subset of
variables, whose index set is denoted by
AX[D]
, and it stores the value
vX
and the number
nX
of visits, which are used to calculate the value of UCB as in Eq. (1). The value
vX
is defined as
the average score (i.e., importance) of the variables contained by
X
, which can be calculated by
s·g(AX)/|AX|
, where
g(AX)
is the Boolean vector representation of
AX
and
|AX|
is the size of
AX, i.e., the number of variables in node X.
At each iteration, MCTS-VS first recursively selects a node with larger UCB until a leaf node (denoted
as
X
), which is regarded as containing important variables. Note that if we optimize the subset
xAX
of variables represented by the leaf
X
directly, the variables in
xAX
will have the same score (because
they are optimized together), and their relative importance cannot be further distinguished. Thus,
MCTS-VS uniformly selects a variable index subset
M
from
AX
at random, and employs BO to
optimize
xM
as well as
xAX\M
; this process is repeated for several times. After that, the information
set
D
will be augmented by the pairs of the selected variable index subset
M
(or
AX\M
) and the
corresponding sampled points generated by BO. The variable score vector
s
will be updated using
this new
D
. Based on
s
, the variable index set
AX
represented by the leaf
X
will be divided into two
disjoint subsets, containing variables with larger and smaller scores (i.e., important and unimportant
variables), respectively, and the leaf
X
will be bifurcated into two child nodes accordingly. Finally,
the
v
values of these two children will be calculated using the variable score vector
s
, and back-
propagation will be performed to update the vvalue and the number of visits of the nodes along the
current path of the tree.
MCTS-VS can be equipped with any specific BO optimizer, resulting in the concrete algorithm
MCTS-VS-BO, where BO is used to optimize the selected subsets of variables during the running
of MCTS-VS. Compared with LA-MCTS [
40
], MCTS-VS applies MCTS to partition the variables
instead of the search space, and thus can be more scalable. Compared with the previous variable
selection method Dropout [
21
], MCTS-VS can select important variables automatically instead of
randomly selecting a fixed number of variables in each iteration. Next we introduce it in detail.
3.1 Details of MCTS-VS
The procedure of MCTS-VS is described in Algorithm 1. In line 1, it first initializes the information
set
D
. In particular, a variable index subset
Mi
is randomly sampled from
[D]
, and the Latin
hypercube sampling [
24
] is used to generate two sets (denoted as
Di
and
D¯
i
) of
Ns
points to form
the two pairs of
(Mi,Di)
and
(¯
Mi,D¯
i)
, where
¯
Mi= [D]\Mi
. This process will be repeated for
Nv
times, resulting in the initial
D={(Mi,Di),(¯
Mi,D¯
i)}Nv
i=1
. The variable score vector
s
is calculated
using this initial
D
in line 3, and the Monte Carlo tree is initialized in line 4 by adding only a root
node, whose
v
value is calculated according to
s
and number of visits is 0. MCTS-VS uses the
variable
t
to record the number of evaluations it has performed, and thus
t
is set to
2×Nv×Ns
in
line 5 as the initial Dcontains 2×Nv×Nssampled points in total.
In each iteration (i.e., lines 7–28) of MCTS-VS, it selects a leaf node
X
by UCB in line 10, and
optimizes the variables (i.e.,
xAX
) represented by
X
in lines 13–23. Note that to measure the
relative importance of variables in
xAX
, MCTS-VS optimizes different subsets of variables of
xAX
4
Algorithm 1 MCTS-VS
Parameters
: batch size
Nv
of variable index subset, sample batch size
Ns
, total number
Ne
of
evaluations, threshold
Nbad
for re-initializing a tree and
Nsplit
for splitting a node, hyper-parameter
kfor the best-kstrategy
Process:
1: Initialize the information set D={(Mi,Di),(¯
Mi,D¯
i)}Nv
i=1;
2: Store the best ksampled points in D;
3: Calculate the variable score susing Das in Eq. (2);
4: Initialize the Monte Carlo tree;
5: Set t= 2 ×Nv×Nsand nbad = 0;
6: while t<Nedo
7: if nbad > Nbad then
8: Initialize the Monte Carlo tree and set nbad = 0
9: end if
10: Xthe leaf node selected by UCB;
11: Let AXdenote the indexes of the subset of variables represented by X;
12: Increase nbad by 1 once visiting a right child node on the path from the root node to X;
13: for j=1:Nvdo
14: Sample a variable index subset Mfrom AXuniformly at random;
15:
Fit a GP model using the points
{(xi
M, yi)}t
i=1
sampled-so-far, where only the variables
indexed by Mare used;
16: Generate {xt+i
M}Ns
i=1 by maximizing an acquisition function;
17: Determine {xt+i
[D]\M}Ns
i=1 by the “fill-in” strategy;
18: Evaluate xt+i= [xt+i
M,xt+i
[D]\M]to obtain yt+ifor i= 1,2, . . . , Ns;
19: D=D∪ {(M,{(xt+i, yt+i)}Ns
i=1)};
20: Store the best kpoints sampled-so-far;
21: t=t+Ns;
22: Repeat lines 15–21 for ¯
M=AX\M
23: end for
24: Calculate the variable score susing Das in Eq. (2);
25: if |AX|> Nsplit then
26:
Bifurcate the leaf node
X
into two child nodes, whose
v
value and number of visits are
calculated by sand set to 0, respectively
27: end if
28:
Back-propagate to update the
v
value and number of visits of the nodes on the path from the
root to X
29: end while
instead of
xAX
directly. That is, a variable index subset
M
is randomly sampled from
AX
in line 14,
and the corresponding subset
xM
of variables is optimized by BO in lines 15–16. The data points
{(xi
M, yi)}t
i=1
sampled-so-far is used to fit a GP model, and
Ns
(called sample batch size) new points
{xt+i
M}Ns
i=1
are generated by maximizing an acquisition function. Note that this is a standard BO
procedure, which can be replaced by any other variant. To evaluate
xt+i
M
, we need to fill in the values
of the other variables
xt+i
[D]\M
, which will be explained later. After evaluating
xt+i= [xt+i
M,xt+i
[D]\M]
in line 18, the information set
D
is augmented with the new pair of
(M,{(xt+i, yt+i)}Ns
i=1)
in line 19,
and
t
is increased by
Ns
accordingly in line 21. For fairness, the complement subset
x¯
M
of variables,
where
¯
M=AX\M
, is also optimized by the same way, i.e., lines 15–21 of Algorithm 1 is repeated
for
¯
M
. The whole process of optimizing
xM
and
x¯
M
in lines 14–22 will be repeated for
Nv
times,
which is called batch size of variable index subset.
To fill in the values of the un-optimized variables in line 17, we employ the best-
k
strategy, which
utilizes the best
k
data points sampled-so-far, denoted as
{(xj, yj)}k
j=1
. That is,
{yj}k
j=1
are
the
k
largest objective values observed-so-far. If the
i
-th variable is un-optimized, its value will be
uniformly selected from
{xj
i}k
j=1
at random. Thus, MCTS-VS needs to store the best
k
data points
in line 2 after initializing the information set
D
, and update them in line 20 after augmenting
D
.
Other direct “fill-in” strategies include sampling the value randomly, or using the average variable
5
摘要:

MonteCarloTreeSearchbasedVariableSelectionforHighDimensionalBayesianOptimizationLeiSong,KeXue,XiaobinHuang,ChaoQianyStateKeyLaboratoryforNovelSoftwareTechnology,NanjingUniversity,Nanjing210023,China{songl,xuek,huangxb,qianc}@lamda.nju.edu.cnAbstractBayesianoptimization(BO)isaclassofpopularmethodsf...

展开>> 收起<<
Monte Carlo Tree Search based Variable Selection for High Dimensional Bayesian Optimization Lei Song Ke Xue Xiaobin Huang Chao Qiany.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:1.56MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注