Spending Thinking Time Wisely Accelerating MCTS with Virtual Expansions Weirui YePieter AbbeelyYang Gaozx

2025-05-03 0 0 2.54MB 20 页 10玖币

侵权投诉

Spending Thinking Time Wisely: Accelerating MCTS

with Virtual Expansions

Weirui Ye∗Pieter Abbeel†Yang Gao ∗‡§

∗Tsinghua University, †UC Berkeley, §Shanghai Qi Zhi Institute

Abstract

One of the most important AI research questions is to trade off computation versus

performance since “perfect rationality" exists in theory but is impossible to achieve

in practice. Recently, Monte-Carlo tree search (MCTS) has attracted considerable

attention due to the signiﬁcant performance improvement in various challenging

domains. However, the expensive time cost during search severely restricts its

scope for applications. This paper proposes the Virtual MCTS (V-MCTS), a variant

of MCTS that spends more search time on harder states and less search time on

simpler states adaptively. We give theoretical bounds of the proposed method and

evaluate the performance and computations on

9×9

Go board games and Atari

games. Experiments show that our method can achieve comparable performances to

the original search algorithm while requiring less than

50%

search time on average.

We believe that this approach is a viable alternative for tasks under limited time and

resources. The code is available at https://github.com/YeWR/V-MCTS.git.

1 Introduction

When artiﬁcial intelligence was ﬁrst studied in the 1950s, researchers have sought to ﬁnd the solution

to the question “How to build an agent with perfect rationality". The term “perfect rationality"

[

] here refers to the decision made with inﬁnite amounts of computations. However, one can

only solve small-scale problems without considering the practical computation time since classical

search algorithms usually exhibit exponential running time. Therefore, recent AI research would no

longer seek to achieve “perfect rationality", but instead carefully trade-off computation versus the

level of rationality. People have developed computational models like “bounded optimality" to model

these settings [

]. The increasing level of rationality under the same computational budget has given

us a lot of AI successes. Algorithms include the Monte-Carlo sampling algorithms, the variational

inference algorithms, and using DNNs as universal function approximators [9, 8, 13, 30, 17].

Recently, MCTS-based RL algorithms have achieved much success, mainly on board games. The

most notable achievement is that AlphaGo beats Hui Fan in 2015 [

]. It is the ﬁrst time a computer

program beat a human professional Go player. Afterward, AlphaGo beats two top-ranking human

players, Lee Sedol in 2016 and Jie Ke in 2017, the latter of which ranked ﬁrst worldwide at the

time. Later, MCTS-based RL algorithms were further extended to other board games and Atari

games [

]. EfﬁcientZero [

] signiﬁcantly improves the sample efﬁciency of MCTS-based RL

algorithms, shedding light on its future applications in the real world like robotics and self-driving.

Despite the impressive performance of MCTS-based RL algorithms, they require massive amounts of

computation to train and evaluate. For example, MuZero [

] used 1000 TPUs trained for 12 hours

to learn the game of Go, and for a single Atari game, it needs 40 TPUs to train 12 hours. Compared

∗ywr20@mails.tsinghua.edu.cn, gaoyangiiis@tsinghua.edu.cn

†pabbeel@berkeley.edu

‡Corresponding author

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.12628v1 [cs.AI] 23 Oct 2022

to previous algorithms on the Atari games benchmark, it needs around two orders of magnitude

more compute. This prohibitively large computational requirement has slowed down both the further

development of MCTS-based RL algorithms as well as its practical use.

Under the hood, MCTS-based RL algorithms imagine the futures when taking different future action

sequences. However, this imaging process for the current method is not computationally efﬁcient.

For example, AlphaGo needs to look ahead 1600 game states to place a single stone. On the contrary,

top human professional players can only think through around 100-200 game states per minute [

Apart from the inefﬁciency, the current MCTS algorithm deals with easy and challenging cases with

the same computational budget. However, human knows to use their time when it is most needed.

In this paper, we aim to design new algorithms that save the computational time of the MCTS-

based RL methods. We make three key

contributions

: (1) We present Virtual MCTS, a variant of

MCTS, to approximate the vanilla MCTS search policies with less computation. Moreover, unlike

previous pruning-based methods that focus on the selection or evaluation stage in MCTS, our method

improves the search loop. It terminates the search iterations earlier adaptively when current states are

simpler; (2) Theoretically, we provide some error bounds of the proposed method. Furthermore, the

visualization results indicate that Virtual MCTS has a better computation and performance trade-off

than vanilla MCTS; (3) Empirically, our method can save more than 50% of search times on the

challenging game Go

9×9

and more than 60% on the visually complex Atari games while keeping

comparable performances to those of vanilla MCTS.

2 Related Work

2.1 Reinforcement Learning with MCTS

For a long time, Computer Go has been regarded as a remarkably challenging game [

]. Researchers

attempt to use Monte-Carlo techniques that evaluate the value of the node state through random

playouts [

]. Afterward, UCT algorithms have generally been applied in Monte-Carlo

tree search (MCTS) algorithms, which use UCB1 to select action at each node of the tree [

Recently, MCTS-based RL methods [

] have become increasingly popular and achieved

super-human performances on board games because of their strong ability to search.

Modern MCTS-based RL algorithms include four stages in the

search loop

: selection, expansion,

evaluation, and backpropagation. The computation bottlenecks in vanilla MCTS come from the

search loop, especially for the evaluation stage and the selection stage of each iteration. The selection

stage is time-consuming when the search tree becomes wider and deeper. The evaluation stage is

quite expensive because people attempt to evaluate the node value by random playouts to the end in

previous researches. Due to the search loop, MCTS-based algorithms have multiple model inferences

compared to other model-free RL methods like PPO [28] and SAC [16].

2.2 Acceleration of MCTS

MCTS-based methods have proved their strong capability of solving complex games or tasks.

However, the high computational cost of MCTS hinders its application to some real-time and more

general scenarios. Therefore, numerous works are devoted to accelerating MCTS. For example, to

make the selection stage more effective, some heuristic pruning methods [

] aim to

reduce the width and depth of the search tree with some heuristic functions. Furthermore, for more

efﬁcient evaluations, Lorentz [

] proposed early playout termination of MCTS (MCTS-EPT) to stop

the random playouts early and use an evaluation function to assess win or loss. Moreover, Hsueh et al.

[

] applied MCTS-EPT to the Chinese dark chess and proved its effectiveness. Afterward, similar

ideas have been applied in the evaluation stage of AlphaGoZero [

] and later MCTS-based methods

[

]. They evaluate the

-values through a learnable evaluation network instead of running

playouts to the end. Grill et al. [

] propose a novel regularized policy optimization method based

on AlphaZero to decrease the search budget of MCTS, which is from the optimization perspective.

Danihelka et al. [

] propose a policy improvement algorithm based on sampling actions without

replacement, named Gumbel trick to achieve better performance when planning with few simulations.

However, these methods mentioned above focus on the speciﬁc stage of the search iteration or reduce

the total budget through pruning and optimization methods, which are orthogonal to us. And few

works targets at the search loop. Lan et al. [

] propose DS-MCTS, which deﬁnes the uncertainty of

MCTS and approximates it by extra DNNs with speciﬁc features for board games in training. During

the evaluation, DS-MCTS will check periodically and stop the search if the state is certain.

3 Background

The AlphaGo series of work [

] are all MCTS-based reinforcement learning algorithms.

Those algorithms assume the environment transition dynamics are known or learn the environment

dynamics. Based on the dynamics, they use the Monte-Carlo tree search (MCTS) as the policy

improvement operator. I.e., taking in the current policy, MCTS returns a better policy with the search

algorithm. The systematic search allows the MCTS-based RL algorithm to quickly improve the

policy and perform much better in the setting where heavy reasoning is required.

3.1 MCTS

This part brieﬂy introduces the MCTS method implemented in reinforcement learning applications.

As mentioned in the related works, modern MCTS-based RL algorithms include four stages in the

search loop, namely selection, expansion, evaluation, and backpropagation.

MCTS takes in the current states and generates a policy after the search loop of

iterations. Here

is a constant number of iterations set by the designer, regarded as the total budget. In the selection

stage of each iteration, an action will be selected by maximizing over UCB. Speciﬁcally, AlphaZero

[

] and MuZero [

] are developed based on a variant of UCB, P-UCT [

] and have achieved great

success on board games and Atari games. The formula of P-UCT is the Eq (1):

ak= arg max

a∈A Q(s, a) + P(s, a)pPb∈A N(s, b)

1 + N(s, a)(c1+ log((X

b∈A

N(s, b) + c2+ 1)/c2)),(1)

where

is the index of iteration,

is the action set,

Q(s, a)

is the estimated Q-value,

P(s, a)

the policy prior obtained from neural networks,

N(s, a)

is the visitations to select the action

from

the state

and

c1, c2

are hyper-parameters. The output of MCTS is the visitation of each action

of the root node. After

search iterations, the ﬁnal policy

π(s)

is deﬁned as the normalized root

visitation distribution

πN(s)

, where

πk(s, a) = N(s, a)/Pb∈A N(s, b) = N(s, a)/k, a ∈ A

. For

simpliﬁcation, we use

πk

in place of

πk(s)

sometimes. And the detailed procedure of MCTS is

introduced in Appendix. In our method, we propose to approximate the ﬁnal policy

πN(s)

with

ˆπk(s)

which we name as a virtual expanded policy, through a new expansion method and a termination rule.

In this way, the number of iterations in MCTS can be reduced from Nto k.

3.2 Computation Requirement

Most of the computations in MCTS-based RL are in the MCTS procedure. Each action taken by

MCTS requires

times neural network evaluations, where

is a constant number of iterations in

the search loop. Traditional RL algorithms, such as PPO [

] or DQN [

], only need a single neural

network evaluation per action. Thus, MCTS-based RL is roughly

times computationally more

expensive than traditional RL algorithms. In practice, training a single Atari game needs 12 hours of

computation time on 40 TPUs [

]. The computation need is roughly two orders of magnitude more

than traditional RL algorithms [28], although the ﬁnal performance of MuZero is much better.

4 Method

We aim to spend more search time on harder states and less on easier states. Intuitively, human knows

when to make a quick decision or a slow decision under different circumstances. Unfortunately,

this situation-aware behavior is absent in current MCTS algorithms. Therefore, we propose an

MCTS variant that terminates the search iteration adaptively. It consists of two components: a novel

expansion method named virtual expansion to estimate the ﬁnal visitation based on the current partial

tree; a termination rule that decides when to terminate based on the hardness of the current scenario.

And we will display the adaptive mechanism through visualizations in Section 5.5.

4.1 Termination Rule

We propose to terminate the search loop earlier based on the current tree statistics. Intuitively, we no

longer need to search further if we ﬁnd that recent searches have little changes on the root visitation

distribution. With this intuition in mind, we propose a simple modiﬁcation to the MCTS search

algorithm. As mentioned in 3.1,

πk(s)

is the policy deﬁned by the visitations of the root state at

iteration

. Let

∆s(i, j)

be the L1 difference of

πi(s), πj(s)

, namely

∆s(i, j) = ||πi(s)−πj(s)||1

Then we terminate the search loop when we have searched at least

iterations and

∆s(k, k/2) < 

where



is a tolerance hyper-parameter,

r∈(0,1)

is the ratio of the minimum search budget and

is the full search iterations. We show that under certain conditions, a bound on

∆s(k, k/2)

implies

a bound on

∆s(k, N)

measures the distance between the current policy

πk(s)

and the

oracle policy

πN(s)

. In this way,

∆s(k, k/2)

reﬂects the hardness of the state

. Consequently, once

the gap is small enough, it is unnecessary for more search iterations.

4.2 Virtual Expansion in MCTS

Algorithm 1 Iteration of vanilla MCTS

1: Current k-th iteration step:

2: Input: A, P, Qk(s, a), Nk(s, a)

3: Initialize: s←sroot

4: repeat do search

5: a∗←UCB1(Q, P, N)

6: s←next state(s, a∗)

7: until Nk(s, a∗)=0

8: Evaluate the state value R(s, a)and P(s, a)

9: for salong the search path do

10: Qk+1(s, a) = Nk(s,a)·Qk(s,a)+R(s,a)

Nk(s,a)+1

11: Nk+1(s, a) = Nk(s, a)+1

12: end for

13: Return Qk+1(s, a), Nk+1(s, a)

Algorithm 2

Iteration of MCTS with

Virtual Ex-

pansion

1: Current k-th iteration step:

2: Input: A, P, Qk(s, a), Nk(s, a),ˆ

Nk(s, a)

4: if Not init ˆ

Nk(s, a)then

5: Init: ˆ

Nk(s, a)←Nk(s, a)

6: end if

8: s←sroot

9: a∗←UCB1(Q, P, ˆ

10: ˆ

Nk(s, a)←ˆ

Nk(s, a)+1

11:

12: Return ˆ

Nk(s, a)

For the termination rule

∆s(k, k/2) < 

, we assume

πi

and

πj

are directly comparable. However,

they are not directly comparable because the tree is expanded with UCT. As the number of visits

increases, the upper bound would be tighter, and the latter visits are more focused on the promising

parts. Thus earlier visitation distributions (smaller iteration number) can exhibit more exploratory

distribution, while latter ones (larger iteration number) are more exploitative on promising parts.

To compare

πi

and

πj

properly, we propose a method called

virtual expansion

in place of the vanilla

expansion. Brieﬂy, it aligns two distributions by virtual UCT expansions until the constant budget

When the tree is expanded at iteration

, it has

N−k

iterations to go. A normal expansion would

require evaluating neural network

N−k

times for a more accurate

Q(s, a)

estimate for each arm at

the root node. Our proposed virtual expansion still expands

N−k

times according to UCT, but it

ignores the

N−k

neural network evaluations and assumes that each arm’s

Q(s, a)

does not change.

We denote the virtual expanded distribution from

πi

as a

virtual expanded policy ˆπi

. By doing

virtual expansions on both

πi

and

πj

, we will obtain the corresponding virtual expanded policies

ˆπi,ˆπj

. Here we effectively remove the different levels of exploration/exploitation in the two policies.

Then the termination condition becomes the difference of virtual expanded policies. We name the

rule as VET-Rule (Virtual Expanded Termination Rule):

∆s(k, k/2) = 



ˆπk(s)−ˆπk/2(s)



< . (2)

The comparisons between vanilla expansion and virtual expansion are illustrated in Algorithm 1

and 2. The time-consuming computations are highlighted in Algorithm 1. Line 4 to 7 in Algorithm

1 target at searching with UCT to reach an unvisited state for exploration. Then it evaluates the

state and backpropagates along the search path to better estimate

-values. After total

iterations,

the visitation distribution of the root node

πN(s)

is considered as the ﬁnal policy

π(s)

. However,

in virtual expansion, listed in Algorithm 2, it only searches one step from the root node. And

it selects actions based on the current estimations without changing any properties of the search

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SpendingThinkingTimeWisely:AcceleratingMCTSwithVirtualExpansionsWeiruiYePieterAbbeelyYangGaozxTsinghuaUniversity,yUCBerkeley,xShanghaiQiZhiInstituteAbstractOneofthemostimportantAIresearchquestionsistotradeoffcomputationversusperformancesinceperfectrationality"existsintheorybutisimpossibletoachie...

展开>> 收起<<

Spending Thinking Time Wisely Accelerating MCTS with Virtual Expansions Weirui YePieter AbbeelyYang Gaozx.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Spending Thinking Time Wisely Accelerating MCTS with Virtual Expansions Weirui YePieter AbbeelyYang Gaozx

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: