Monte Carlo Tree Search based Variable Selection for High Dimensional Bayesian Optimization Lei Song Ke Xue Xiaobin Huang Chao Qiany

2025-05-02 0 0 1.56MB 28 页 10玖币

侵权投诉

Monte Carlo Tree Search based Variable Selection

for High Dimensional Bayesian Optimization

Lei Song∗

, Ke Xue∗

, Xiaobin Huang, Chao Qian†

State Key Laboratory for Novel Software Technology,

Nanjing University, Nanjing 210023, China

{songl, xuek, huangxb, qianc}@lamda.nju.edu.cn

Abstract

Bayesian optimization (BO) is a class of popular methods for expensive black-box

optimization, and has been widely applied to many scenarios. However, BO suffers

from the curse of dimensionality, and scaling it to high-dimensional problems is still

a challenge. In this paper, we propose a variable selection method MCTS-VS based

on Monte Carlo tree search (MCTS), to iteratively select and optimize a subset of

variables. That is, MCTS-VS constructs a low-dimensional subspace via MCTS and

optimizes in the subspace with any BO algorithm. We give a theoretical analysis of

the general variable selection method to reveal how it can work. Experiments on

high-dimensional synthetic functions and real-world problems (i.e., NAS-bench

problems and MuJoCo locomotion tasks) show that MCTS-VS equipped with a

proper BO optimizer can achieve state-of-the-art performance.

1 Introduction

In many real-world tasks such as neural architecture search (NAS) [

] and policy search in rein-

forcement learning (RL) [

], one often needs to solve the expensive black-box optimization problems.

Bayesian optimization (BO) [

] is a sample-efﬁcient algorithm for solving such problems.

It iteratively ﬁts a surrogate model, typically Gaussian process (GP), and maximizes an acquisition

function to obtain the next point to evaluate. While BO has been employed in a wide variety of

settings, successful applications are often limited to low-dimensional problems.

Recently, scaling BO to high-dimensional problems has received a lot of interest. Decomposition-

based methods [

] assume that the high-dimensional function to be optimized has

a certain structure, typically the additive structure. By decomposing the original high-dimensional

function into the sum of several low-dimensional functions, they optimize each low-dimensional

function to obtain the point in the high-dimensional space. However, it is not easy to decide whether

a decomposition exists as well as to learn the decomposition.

Other methods often assume that the original high-dimensional function with dimension

has

a low-dimensional subspace with dimension

dD

, and then perform the optimization in the

low-dimensional subspace and project the low-dimensional point back for evaluation. For example,

embedding-based methods [

] use a random matrix to embed the original space into the low-

dimensional subspace. Another way is to select a subset of variables directly, which can even avoid

the time-consuming matrix operations of embedding-based methods. For example, Dropout [

]

selects

variables randomly in each iteration. Note that for both embedding and variable selection

methods, the parameter

can have a large inﬂuence on the performance, which is, however, difﬁcult

to set in real-world problems.

∗Equal Contribution

†Corresponding Author

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.01628v2 [cs.LG] 16 Nov 2022

In this paper, we propose a new Variable Selection method using Monte Carlo Tree Search (MCTS),

called MCTS-VS. MCTS is employed to partition the variables into important and unimportant ones,

and only those selected important variables are optimized via any black-box optimization algorithm,

e.g., vanilla BO [

] or TuRBO [

]. The values of unimportant variables are sampled using historical

information. Compared with Dropout-BO, MCTS-VS can select important variables automatically.

We also provide regret and computational complexity analyses of general variable selection methods,

showing that variable selection can reduce the computational complexity while increasing the

cumulative regret. Our regret bound generalizes that of GP-UCB [

] which always selects all

variables, as well as that of Dropout [

] which selects

variables randomly in each iteration. The

results suggest that a good variable selection method should select as important variables as possible.

Experiments on high-dimensional synthetic functions and real-world problems (i.e., NAS and RL

problems) show that MCTS-VS is better than the previous variable selection method Dropout [

and can also achieve the competitive performance to state-of-the-art BO algorithms. Furthermore, its

running time is small due to the advantage of variable selection. We also observe that MCTS-VS can

select important variables, explaining its good performance based on our theoretical analysis.

2 Background

2.1 Bayesian Optimization

We consider the problem

maxx∈X f(x)

, where

is a black-box function and

X ⊆ RD

is the domain.

The basic framework of BO contains two critical components: a surrogate model and an acquisition

function. GP is the most popular surrogate model. Given the sampled data points

{(xi, yi)}t−1

i=1

where

yi=f(xi) + i

and

i∼ N (0, η2)

is the observation noise, GP at iteration

seeks to infer

f∼ GP(µ(·), k(·,·) + η2I)

, speciﬁed by the mean

µ(·)

and covariance kernel

k(·,·)

, where

the identity matrix of size

. After that, an acquisition function, e.g., Probability of Improvement

(PI) [

], Expected Improvement (EI) [

] or Upper Conﬁdence Bound (UCB) [

], is used to

determine the next query point xtwhile balancing exploitation and exploration.

2.2 High-dimensional Bayesian Optimization

Scaling BO to high-dimensional problems is a challenge due to the curse of dimensionality and the

computation cost. As the dimension increases, the search space increases exponentially, requiring

more samples, and thus more expensive evaluations, to ﬁnd a good solution. Furthermore, the

computation cost of updating the GP model and optimizing the acquisition function will be very

time-consuming [

]. There have been a few common approaches to tackle high-dimensional BO

with different assumptions.

Decomposition.

Assuming that the function can be decomposed into the sum of low-dimensional

functions with disjoint subspaces, Kandasamy et al.

[17]

proposed the Add-GP-UCB algorithm to

optimize those low-dimensional functions separately, which was further generalized to overlapping

subspaces [

]. Wang et al.

[43]

proposed ensemble BO that uses an ensemble of additive GP

models for scalability. Han et al.

[13]

constrained the dependency graphs of decomposition to tree

structures to facilitate the decomposition learning and optimization. For most problems, however, the

decomposition is unknown, and also difﬁcult to learn.

Embedding.

Assuming that only a few dimensions affect the high-dimensional function signiﬁcantly,

embedding-based methods embed the high-dimensional space into a low-dimensional subspace, and

optimize in the subspace while projecting the point back for evaluation. REMBO and its variants use

a random matrix to embed the search space into a low-dimensional subspace [

]. Nayebi et al.

[27]

used a hash-based method for embedding. Letham et al.

[20]

proposed ALEBO, focusing on

several misconceptions in REMBO to improve the performance. The VAE-based approaches were

also employed to project a structured input space (e.g., graphs and images) to a low-dimensional

subspace [12, 22].

Variable Selection.

Based on the same assumption as embedding, variable selection methods

iteratively select a subset of variables to build a low-dimensional subspace and optimize through

BO. The selected variables can be viewed as important variables that are valuable for exploitation,

or having high uncertainty that are valuable for exploration. A classical method is Dropout [

which randomly chooses

variables in each iteration. Spagnol et al.

[37]

uses Hilbert Schmidt

Independence criterion to guide variable selection. When evaluating the sampled point, the values

of those unselected variables are obtained by random sampling or using historical information.

VS-BO [

] selects variables with larger estimated gradients and uses CMA-ES [

] to obtain the

values of unselected variables. Note that variable selection can be faster than embedding, because the

embedding cost (e.g., matrix inversion) is time-consuming for high-dimensional optimization.

Both embedding and variable selection methods need to specify the parameter

, i.e., the dimension

of low-dimensional subspace, which will affect the performance signiﬁcantly, but is not easy to set.

There are also some methods to improve the basic components of BO directly for high-dimensional

problems. For example, DNGO [

] uses the neural network as an alternative of GP to speed up

inference; BO-PP [

] generates pseudo-points (i.e., data points whose objective values are not

evaluated) to improve the GP model; SAASBO [

] uses sparsity-inducing prior to perform variable

selection implicitly, making the coefﬁcients of unimportant variables near to zero and thus restraining

over-exploration on these variables. Note that different from Dropout and our proposed MCTS-VS,

SAASBO still optimizes all variables, and also due to its high computational cost of inference, it is

very time-consuming as reported in [

]. These methods can be combined with the above-mentioned

dimensionality reduction methods, which may bring further improvement.

2.3 Monte Carlo Tree Search

MCTS [

] is a tree search algorithm based on random sampling, and has shown great success in

high-dimensional tasks, such as Go [

]. A tree node represents a state, describing the current

situation, e.g., the position in path planning. Each tree node

stores a value

representing its

goodness, and the number nXthat it has been visited. They are used to calculate UCB [1], i.e.,

vX+ 2Cpq2(log np)/nX,(1)

where

is a hyper-parameter, and

is the number of visits of the parent of

. UCB considers

both exploitation and exploration, and will be used for node selection.

MCTS iteratively selects a leaf node of the tree for expansion. Each iteration can be divided into four

steps: selection,expansion,simulation and back-propagation. Starting from the root node, selection

is to recursively select a node with larger UCB until a leaf node, denoted as

. Expansion is to

execute a certain action in the state represented by

and transfer to the next state, e.g., move forward

and arrive at a new position in path planning. We use the child node

to represent the next

state. Simulation is to obtain the value

via random sampling. Back-propagation is to update the

value and the number of visits of Y’s ancestors.

To tackle high-dimensional optimization, Wang et al.

[40]

proposed LA-MCTS, which applies MCTS

to iteratively partition the search space into small sub-regions, and optimizes only in the good sub-

regions. That is, the root of the tree represents the entire search space

Ω

, and each tree node

represents a sub-region

ΩX

. The value

is measured by the average objective value of the sampled

points in the sub-region

ΩX

. In each iteration, after selecting a leaf node

, LA-MCTS performs

the optimization in

ΩX

by vanilla BO [

] or TuRBO [

], and the sampled points are used for

clustering and classiﬁcation to bifurcate

ΩX

into two disjoint sub-regions, which are “good” and

“bad”, respectively. Note that the sub-regions are generated by dividing the range of variables, and

their dimensionality does not decrease, which is still the number of all variables. Wang et al.

[40]

have empirically shown the good performance of LA-MCTS. However, as the dimension increases,

the search space increases exponentially, and more partitions and evaluations are required to ﬁnd a

good solution, making the application of LA-MCTS to high-dimensional optimization still limited.

3 MCTS-VS Method

In this section, we propose a Variable Selection method based on MCTS for high-dimensional

BO, brieﬂy called MCTS-VS. The main idea is to apply MCTS to iteratively partition all variables

into important and unimportant ones, and perform BO only for those important variables. Let

[D] = {1,2, . . . , D}

denote the indexes of all variables

, and

denote the subset of variables

indexed by M⊆[D].

We ﬁrst introduce a

-dimensional vector named variable score, which is a key component of

MCTS-VS. Its

-th element represents the importance of the

-th variable

. During the running

process of MCTS-VS, after optimizing a subset

of variables where

M⊆[D]

denotes the indexes

of the variables, a set

of sampled points will be generated, and the pair

(M,D)

will be recorded

into a set D, called information set. The variable score vector is based on D, and calculated as

s=

X

(M,D)∈DX

(xi,yi)∈D

yi·g(M)



X

(M,D)∈D

|D| · g(M)

,(2)

where the function

g: 2[D]→ {0,1}D

gives the Boolean vector representation of a variable index

subset

M⊆[D]

(i.e., the

-th element of

g(M)

i∈M

, and 0 otherwise), and

is the element-

wise division. Each dimension of

P(M,D)∈DP(xi,yi)∈D yi·g(M)

is the sum of query evaluations

using each variable, and each dimension of

P(M,D)∈D|D| · g(M)

is the number of queries using

each variable. Thus, the

-th element of variable score

, representing the importance of the

-th

variable

, is actually measured by the average goodness of all the sampled points that are generated

by optimizing a subset of variables containing

. The variable score

will be used to deﬁne the

value of each tree node of MCTS as well as for node expansion.

In MCTS-VS, the root of the tree represents all variables. A tree node

represents a subset of

variables, whose index set is denoted by

AX⊆[D]

, and it stores the value

and the number

of visits, which are used to calculate the value of UCB as in Eq. (1). The value

is deﬁned as

the average score (i.e., importance) of the variables contained by

, which can be calculated by

s·g(AX)/|AX|

, where

g(AX)

is the Boolean vector representation of

and

|AX|

is the size of

AX, i.e., the number of variables in node X.

At each iteration, MCTS-VS ﬁrst recursively selects a node with larger UCB until a leaf node (denoted

), which is regarded as containing important variables. Note that if we optimize the subset

xAX

of variables represented by the leaf

directly, the variables in

xAX

will have the same score (because

they are optimized together), and their relative importance cannot be further distinguished. Thus,

MCTS-VS uniformly selects a variable index subset

from

at random, and employs BO to

optimize

as well as

xAX\M

; this process is repeated for several times. After that, the information

set

will be augmented by the pairs of the selected variable index subset

(or

AX\M

) and the

corresponding sampled points generated by BO. The variable score vector

will be updated using

this new

. Based on

, the variable index set

represented by the leaf

will be divided into two

disjoint subsets, containing variables with larger and smaller scores (i.e., important and unimportant

variables), respectively, and the leaf

will be bifurcated into two child nodes accordingly. Finally,

the

values of these two children will be calculated using the variable score vector

, and back-

propagation will be performed to update the vvalue and the number of visits of the nodes along the

current path of the tree.

MCTS-VS can be equipped with any speciﬁc BO optimizer, resulting in the concrete algorithm

MCTS-VS-BO, where BO is used to optimize the selected subsets of variables during the running

of MCTS-VS. Compared with LA-MCTS [

], MCTS-VS applies MCTS to partition the variables

instead of the search space, and thus can be more scalable. Compared with the previous variable

selection method Dropout [

], MCTS-VS can select important variables automatically instead of

randomly selecting a ﬁxed number of variables in each iteration. Next we introduce it in detail.

3.1 Details of MCTS-VS

The procedure of MCTS-VS is described in Algorithm 1. In line 1, it ﬁrst initializes the information

set

. In particular, a variable index subset

is randomly sampled from

[D]

, and the Latin

hypercube sampling [

] is used to generate two sets (denoted as

and

D¯

) of

points to form

the two pairs of

(Mi,Di)

and

(¯

Mi,D¯

, where

Mi= [D]\Mi

. This process will be repeated for

times, resulting in the initial

D={(Mi,Di),(¯

Mi,D¯

i)}Nv

i=1

. The variable score vector

is calculated

using this initial

in line 3, and the Monte Carlo tree is initialized in line 4 by adding only a root

node, whose

value is calculated according to

and number of visits is 0. MCTS-VS uses the

variable

to record the number of evaluations it has performed, and thus

is set to

2×Nv×Ns

line 5 as the initial Dcontains 2×Nv×Nssampled points in total.

In each iteration (i.e., lines 7–28) of MCTS-VS, it selects a leaf node

by UCB in line 10, and

optimizes the variables (i.e.,

xAX

) represented by

in lines 13–23. Note that to measure the

relative importance of variables in

xAX

, MCTS-VS optimizes different subsets of variables of

xAX

Algorithm 1 MCTS-VS

Parameters

: batch size

of variable index subset, sample batch size

, total number

evaluations, threshold

Nbad

for re-initializing a tree and

Nsplit

for splitting a node, hyper-parameter

kfor the best-kstrategy

Process:

1: Initialize the information set D={(Mi,Di),(¯

Mi,D¯

i)}Nv

i=1;

2: Store the best ksampled points in D;

3: Calculate the variable score susing Das in Eq. (2);

4: Initialize the Monte Carlo tree;

5: Set t= 2 ×Nv×Nsand nbad = 0;

6: while t<Nedo

7: if nbad > Nbad then

8: Initialize the Monte Carlo tree and set nbad = 0

9: end if

10: X←the leaf node selected by UCB;

11: Let AXdenote the indexes of the subset of variables represented by X;

12: Increase nbad by 1 once visiting a right child node on the path from the root node to X;

13: for j=1:Nvdo

14: Sample a variable index subset Mfrom AXuniformly at random;

15:

Fit a GP model using the points

{(xi

M, yi)}t

i=1

sampled-so-far, where only the variables

indexed by Mare used;

16: Generate {xt+i

M}Ns

i=1 by maximizing an acquisition function;

17: Determine {xt+i

[D]\M}Ns

i=1 by the “ﬁll-in” strategy;

18: Evaluate xt+i= [xt+i

M,xt+i

[D]\M]to obtain yt+ifor i= 1,2, . . . , Ns;

19: D=D∪ {(M,{(xt+i, yt+i)}Ns

i=1)};

20: Store the best kpoints sampled-so-far;

21: t=t+Ns;

22: Repeat lines 15–21 for ¯

M=AX\M

23: end for

24: Calculate the variable score susing Das in Eq. (2);

25: if |AX|> Nsplit then

26:

Bifurcate the leaf node

into two child nodes, whose

value and number of visits are

calculated by sand set to 0, respectively

27: end if

28:

Back-propagate to update the

value and number of visits of the nodes on the path from the

root to X

29: end while

instead of

xAX

directly. That is, a variable index subset

is randomly sampled from

in line 14,

and the corresponding subset

of variables is optimized by BO in lines 15–16. The data points

{(xi

M, yi)}t

i=1

sampled-so-far is used to ﬁt a GP model, and

(called sample batch size) new points

{xt+i

M}Ns

i=1

are generated by maximizing an acquisition function. Note that this is a standard BO

procedure, which can be replaced by any other variant. To evaluate

xt+i

, we need to ﬁll in the values

of the other variables

xt+i

[D]\M

, which will be explained later. After evaluating

xt+i= [xt+i

M,xt+i

[D]\M]

in line 18, the information set

is augmented with the new pair of

(M,{(xt+i, yt+i)}Ns

i=1)

in line 19,

and

is increased by

accordingly in line 21. For fairness, the complement subset

x¯

of variables,

where

M=AX\M

, is also optimized by the same way, i.e., lines 15–21 of Algorithm 1 is repeated

for

. The whole process of optimizing

and

x¯

in lines 14–22 will be repeated for

times,

which is called batch size of variable index subset.

To ﬁll in the values of the un-optimized variables in line 17, we employ the best-

strategy, which

utilizes the best

data points sampled-so-far, denoted as

{(x∗j, y∗j)}k

j=1

. That is,

{y∗j}k

j=1

are

the

largest objective values observed-so-far. If the

-th variable is un-optimized, its value will be

uniformly selected from

{x∗j

i}k

j=1

at random. Thus, MCTS-VS needs to store the best

data points

in line 2 after initializing the information set

, and update them in line 20 after augmenting

Other direct “ﬁll-in” strategies include sampling the value randomly, or using the average variable

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MonteCarloTreeSearchbasedVariableSelectionforHighDimensionalBayesianOptimizationLeiSong,KeXue,XiaobinHuang,ChaoQianyStateKeyLaboratoryforNovelSoftwareTechnology,NanjingUniversity,Nanjing210023,China{songl,xuek,huangxb,qianc}@lamda.nju.edu.cnAbstractBayesianoptimization(BO)isaclassofpopularmethodsf...

展开>> 收起<<

Monte Carlo Tree Search based Variable Selection for High Dimensional Bayesian Optimization Lei Song Ke Xue Xiaobin Huang Chao Qiany.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Monte Carlo Tree Search based Variable Selection for High Dimensional Bayesian Optimization Lei Song Ke Xue Xiaobin Huang Chao Qiany

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: