Spending Thinking Time Wisely Accelerating MCTS with Virtual Expansions Weirui YePieter AbbeelyYang Gaozx

2025-05-03 0 0 2.54MB 20 页 10玖币
侵权投诉
Spending Thinking Time Wisely: Accelerating MCTS
with Virtual Expansions
Weirui YePieter AbbeelYang Gao ∗‡§
Tsinghua University, UC Berkeley, §Shanghai Qi Zhi Institute
Abstract
One of the most important AI research questions is to trade off computation versus
performance since “perfect rationality" exists in theory but is impossible to achieve
in practice. Recently, Monte-Carlo tree search (MCTS) has attracted considerable
attention due to the significant performance improvement in various challenging
domains. However, the expensive time cost during search severely restricts its
scope for applications. This paper proposes the Virtual MCTS (V-MCTS), a variant
of MCTS that spends more search time on harder states and less search time on
simpler states adaptively. We give theoretical bounds of the proposed method and
evaluate the performance and computations on
9×9
Go board games and Atari
games. Experiments show that our method can achieve comparable performances to
the original search algorithm while requiring less than
50%
search time on average.
We believe that this approach is a viable alternative for tasks under limited time and
resources. The code is available at https://github.com/YeWR/V-MCTS.git.
1 Introduction
When artificial intelligence was first studied in the 1950s, researchers have sought to find the solution
to the question “How to build an agent with perfect rationality". The term “perfect rationality"
[
7
,
24
,
26
] here refers to the decision made with infinite amounts of computations. However, one can
only solve small-scale problems without considering the practical computation time since classical
search algorithms usually exhibit exponential running time. Therefore, recent AI research would no
longer seek to achieve “perfect rationality", but instead carefully trade-off computation versus the
level of rationality. People have developed computational models like “bounded optimality" to model
these settings [
26
]. The increasing level of rationality under the same computational budget has given
us a lot of AI successes. Algorithms include the Monte-Carlo sampling algorithms, the variational
inference algorithms, and using DNNs as universal function approximators [9, 8, 13, 30, 17].
Recently, MCTS-based RL algorithms have achieved much success, mainly on board games. The
most notable achievement is that AlphaGo beats Hui Fan in 2015 [
30
]. It is the first time a computer
program beat a human professional Go player. Afterward, AlphaGo beats two top-ranking human
players, Lee Sedol in 2016 and Jie Ke in 2017, the latter of which ranked first worldwide at the
time. Later, MCTS-based RL algorithms were further extended to other board games and Atari
games [
27
]. EfficientZero [
34
] significantly improves the sample efficiency of MCTS-based RL
algorithms, shedding light on its future applications in the real world like robotics and self-driving.
Despite the impressive performance of MCTS-based RL algorithms, they require massive amounts of
computation to train and evaluate. For example, MuZero [
27
] used 1000 TPUs trained for 12 hours
to learn the game of Go, and for a single Atari game, it needs 40 TPUs to train 12 hours. Compared
ywr20@mails.tsinghua.edu.cn, gaoyangiiis@tsinghua.edu.cn
pabbeel@berkeley.edu
Corresponding author
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.12628v1 [cs.AI] 23 Oct 2022
to previous algorithms on the Atari games benchmark, it needs around two orders of magnitude
more compute. This prohibitively large computational requirement has slowed down both the further
development of MCTS-based RL algorithms as well as its practical use.
Under the hood, MCTS-based RL algorithms imagine the futures when taking different future action
sequences. However, this imaging process for the current method is not computationally efficient.
For example, AlphaGo needs to look ahead 1600 game states to place a single stone. On the contrary,
top human professional players can only think through around 100-200 game states per minute [
30
].
Apart from the inefficiency, the current MCTS algorithm deals with easy and challenging cases with
the same computational budget. However, human knows to use their time when it is most needed.
In this paper, we aim to design new algorithms that save the computational time of the MCTS-
based RL methods. We make three key
contributions
: (1) We present Virtual MCTS, a variant of
MCTS, to approximate the vanilla MCTS search policies with less computation. Moreover, unlike
previous pruning-based methods that focus on the selection or evaluation stage in MCTS, our method
improves the search loop. It terminates the search iterations earlier adaptively when current states are
simpler; (2) Theoretically, we provide some error bounds of the proposed method. Furthermore, the
visualization results indicate that Virtual MCTS has a better computation and performance trade-off
than vanilla MCTS; (3) Empirically, our method can save more than 50% of search times on the
challenging game Go
9×9
and more than 60% on the visually complex Atari games while keeping
comparable performances to those of vanilla MCTS.
2 Related Work
2.1 Reinforcement Learning with MCTS
For a long time, Computer Go has been regarded as a remarkably challenging game [
3
,
6
]. Researchers
attempt to use Monte-Carlo techniques that evaluate the value of the node state through random
playouts [
4
,
11
,
12
,
30
]. Afterward, UCT algorithms have generally been applied in Monte-Carlo
tree search (MCTS) algorithms, which use UCB1 to select action at each node of the tree [
20
].
Recently, MCTS-based RL methods [
30
,
32
,
31
,
27
] have become increasingly popular and achieved
super-human performances on board games because of their strong ability to search.
Modern MCTS-based RL algorithms include four stages in the
search loop
: selection, expansion,
evaluation, and backpropagation. The computation bottlenecks in vanilla MCTS come from the
search loop, especially for the evaluation stage and the selection stage of each iteration. The selection
stage is time-consuming when the search tree becomes wider and deeper. The evaluation stage is
quite expensive because people attempt to evaluate the node value by random playouts to the end in
previous researches. Due to the search loop, MCTS-based algorithms have multiple model inferences
compared to other model-free RL methods like PPO [28] and SAC [16].
2.2 Acceleration of MCTS
MCTS-based methods have proved their strong capability of solving complex games or tasks.
However, the high computational cost of MCTS hinders its application to some real-time and more
general scenarios. Therefore, numerous works are devoted to accelerating MCTS. For example, to
make the selection stage more effective, some heuristic pruning methods [
14
,
33
,
29
,
1
,
2
] aim to
reduce the width and depth of the search tree with some heuristic functions. Furthermore, for more
efficient evaluations, Lorentz [
22
] proposed early playout termination of MCTS (MCTS-EPT) to stop
the random playouts early and use an evaluation function to assess win or loss. Moreover, Hsueh et al.
[
18
] applied MCTS-EPT to the Chinese dark chess and proved its effectiveness. Afterward, similar
ideas have been applied in the evaluation stage of AlphaGoZero [
32
] and later MCTS-based methods
[
31
,
27
,
34
]. They evaluate the
Q
-values through a learnable evaluation network instead of running
playouts to the end. Grill et al. [
15
] propose a novel regularized policy optimization method based
on AlphaZero to decrease the search budget of MCTS, which is from the optimization perspective.
Danihelka et al. [
10
] propose a policy improvement algorithm based on sampling actions without
replacement, named Gumbel trick to achieve better performance when planning with few simulations.
However, these methods mentioned above focus on the specific stage of the search iteration or reduce
the total budget through pruning and optimization methods, which are orthogonal to us. And few
works targets at the search loop. Lan et al. [
21
] propose DS-MCTS, which defines the uncertainty of
2
MCTS and approximates it by extra DNNs with specific features for board games in training. During
the evaluation, DS-MCTS will check periodically and stop the search if the state is certain.
3 Background
The AlphaGo series of work [
30
,
32
,
31
,
27
] are all MCTS-based reinforcement learning algorithms.
Those algorithms assume the environment transition dynamics are known or learn the environment
dynamics. Based on the dynamics, they use the Monte-Carlo tree search (MCTS) as the policy
improvement operator. I.e., taking in the current policy, MCTS returns a better policy with the search
algorithm. The systematic search allows the MCTS-based RL algorithm to quickly improve the
policy and perform much better in the setting where heavy reasoning is required.
3.1 MCTS
This part briefly introduces the MCTS method implemented in reinforcement learning applications.
As mentioned in the related works, modern MCTS-based RL algorithms include four stages in the
search loop, namely selection, expansion, evaluation, and backpropagation.
MCTS takes in the current states and generates a policy after the search loop of
N
iterations. Here
N
is a constant number of iterations set by the designer, regarded as the total budget. In the selection
stage of each iteration, an action will be selected by maximizing over UCB. Specifically, AlphaZero
[
31
] and MuZero [
27
] are developed based on a variant of UCB, P-UCT [
25
] and have achieved great
success on board games and Atari games. The formula of P-UCT is the Eq (1):
ak= arg max
a∈A Q(s, a) + P(s, a)pPb∈A N(s, b)
1 + N(s, a)(c1+ log((X
b∈A
N(s, b) + c2+ 1)/c2)),(1)
where
k
is the index of iteration,
A
is the action set,
Q(s, a)
is the estimated Q-value,
P(s, a)
is
the policy prior obtained from neural networks,
N(s, a)
is the visitations to select the action
a
from
the state
s
and
c1, c2
are hyper-parameters. The output of MCTS is the visitation of each action
of the root node. After
N
search iterations, the final policy
π(s)
is defined as the normalized root
visitation distribution
πN(s)
, where
πk(s, a) = N(s, a)/Pb∈A N(s, b) = N(s, a)/k, a ∈ A
. For
simplification, we use
πk
in place of
πk(s)
sometimes. And the detailed procedure of MCTS is
introduced in Appendix. In our method, we propose to approximate the final policy
πN(s)
with
ˆπk(s)
,
which we name as a virtual expanded policy, through a new expansion method and a termination rule.
In this way, the number of iterations in MCTS can be reduced from Nto k.
3.2 Computation Requirement
Most of the computations in MCTS-based RL are in the MCTS procedure. Each action taken by
MCTS requires
N
times neural network evaluations, where
N
is a constant number of iterations in
the search loop. Traditional RL algorithms, such as PPO [
28
] or DQN [
23
], only need a single neural
network evaluation per action. Thus, MCTS-based RL is roughly
N
times computationally more
expensive than traditional RL algorithms. In practice, training a single Atari game needs 12 hours of
computation time on 40 TPUs [
27
]. The computation need is roughly two orders of magnitude more
than traditional RL algorithms [28], although the final performance of MuZero is much better.
4 Method
We aim to spend more search time on harder states and less on easier states. Intuitively, human knows
when to make a quick decision or a slow decision under different circumstances. Unfortunately,
this situation-aware behavior is absent in current MCTS algorithms. Therefore, we propose an
MCTS variant that terminates the search iteration adaptively. It consists of two components: a novel
expansion method named virtual expansion to estimate the final visitation based on the current partial
tree; a termination rule that decides when to terminate based on the hardness of the current scenario.
And we will display the adaptive mechanism through visualizations in Section 5.5.
3
4.1 Termination Rule
We propose to terminate the search loop earlier based on the current tree statistics. Intuitively, we no
longer need to search further if we find that recent searches have little changes on the root visitation
distribution. With this intuition in mind, we propose a simple modification to the MCTS search
algorithm. As mentioned in 3.1,
πk(s)
is the policy defined by the visitations of the root state at
iteration
k
. Let
s(i, j)
be the L1 difference of
πi(s), πj(s)
, namely
s(i, j) = ||πi(s)πj(s)||1
.
Then we terminate the search loop when we have searched at least
rN
iterations and
s(k, k/2) < 
,
where
is a tolerance hyper-parameter,
r(0,1)
is the ratio of the minimum search budget and
N
is the full search iterations. We show that under certain conditions, a bound on
s(k, k/2)
implies
a bound on
s(k, N)
.
s(k, N)
measures the distance between the current policy
πk(s)
and the
oracle policy
πN(s)
. In this way,
s(k, k/2)
reflects the hardness of the state
s
. Consequently, once
the gap is small enough, it is unnecessary for more search iterations.
4.2 Virtual Expansion in MCTS
Algorithm 1 Iteration of vanilla MCTS
1: Current k-th iteration step:
2: Input: A, P, Qk(s, a), Nk(s, a)
3: Initialize: ssroot
4: repeat do search
5: aUCB1(Q, P, N)
6: snext state(s, a)
7: until Nk(s, a)=0
8: Evaluate the state value R(s, a)and P(s, a)
9: for salong the search path do
10: Qk+1(s, a) = Nk(s,a)·Qk(s,a)+R(s,a)
Nk(s,a)+1
11: Nk+1(s, a) = Nk(s, a)+1
12: end for
13: Return Qk+1(s, a), Nk+1(s, a)
Algorithm 2
Iteration of MCTS with
Virtual Ex-
pansion
1: Current k-th iteration step:
2: Input: A, P, Qk(s, a), Nk(s, a),ˆ
Nk(s, a)
3:
4: if Not init ˆ
Nk(s, a)then
5: Init: ˆ
Nk(s, a)Nk(s, a)
6: end if
7:
8: ssroot
9: aUCB1(Q, P, ˆ
N)
10: ˆ
Nk(s, a)ˆ
Nk(s, a)+1
11:
12: Return ˆ
Nk(s, a)
For the termination rule
s(k, k/2) < 
, we assume
πi
and
πj
are directly comparable. However,
they are not directly comparable because the tree is expanded with UCT. As the number of visits
increases, the upper bound would be tighter, and the latter visits are more focused on the promising
parts. Thus earlier visitation distributions (smaller iteration number) can exhibit more exploratory
distribution, while latter ones (larger iteration number) are more exploitative on promising parts.
To compare
πi
and
πj
properly, we propose a method called
virtual expansion
in place of the vanilla
expansion. Briefly, it aligns two distributions by virtual UCT expansions until the constant budget
N
.
When the tree is expanded at iteration
k
, it has
Nk
iterations to go. A normal expansion would
require evaluating neural network
Nk
times for a more accurate
Q(s, a)
estimate for each arm at
the root node. Our proposed virtual expansion still expands
Nk
times according to UCT, but it
ignores the
Nk
neural network evaluations and assumes that each arm’s
Q(s, a)
does not change.
We denote the virtual expanded distribution from
πi
as a
virtual expanded policy ˆπi
. By doing
virtual expansions on both
πi
and
πj
, we will obtain the corresponding virtual expanded policies
ˆπi,ˆπj
. Here we effectively remove the different levels of exploration/exploitation in the two policies.
Then the termination condition becomes the difference of virtual expanded policies. We name the
rule as VET-Rule (Virtual Expanded Termination Rule):
ˆ
s(k, k/2) =
ˆπk(s)ˆπk/2(s)
< . (2)
The comparisons between vanilla expansion and virtual expansion are illustrated in Algorithm 1
and 2. The time-consuming computations are highlighted in Algorithm 1. Line 4 to 7 in Algorithm
1 target at searching with UCT to reach an unvisited state for exploration. Then it evaluates the
state and backpropagates along the search path to better estimate
Q
-values. After total
N
iterations,
the visitation distribution of the root node
πN(s)
is considered as the final policy
π(s)
. However,
in virtual expansion, listed in Algorithm 2, it only searches one step from the root node. And
it selects actions based on the current estimations without changing any properties of the search
4
摘要:

SpendingThinkingTimeWisely:AcceleratingMCTSwithVirtualExpansionsWeiruiYePieterAbbeelyYangGaozxTsinghuaUniversity,yUCBerkeley,xShanghaiQiZhiInstituteAbstractOneofthemostimportantAIresearchquestionsistotradeoffcomputationversusperformancesince“perfectrationality"existsintheorybutisimpossibletoachie...

展开>> 收起<<
Spending Thinking Time Wisely Accelerating MCTS with Virtual Expansions Weirui YePieter AbbeelyYang Gaozx.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:2.54MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注