to previous algorithms on the Atari games benchmark, it needs around two orders of magnitude
more compute. This prohibitively large computational requirement has slowed down both the further
development of MCTS-based RL algorithms as well as its practical use.
Under the hood, MCTS-based RL algorithms imagine the futures when taking different future action
sequences. However, this imaging process for the current method is not computationally efficient.
For example, AlphaGo needs to look ahead 1600 game states to place a single stone. On the contrary,
top human professional players can only think through around 100-200 game states per minute [
30
].
Apart from the inefficiency, the current MCTS algorithm deals with easy and challenging cases with
the same computational budget. However, human knows to use their time when it is most needed.
In this paper, we aim to design new algorithms that save the computational time of the MCTS-
based RL methods. We make three key
contributions
: (1) We present Virtual MCTS, a variant of
MCTS, to approximate the vanilla MCTS search policies with less computation. Moreover, unlike
previous pruning-based methods that focus on the selection or evaluation stage in MCTS, our method
improves the search loop. It terminates the search iterations earlier adaptively when current states are
simpler; (2) Theoretically, we provide some error bounds of the proposed method. Furthermore, the
visualization results indicate that Virtual MCTS has a better computation and performance trade-off
than vanilla MCTS; (3) Empirically, our method can save more than 50% of search times on the
challenging game Go
9×9
and more than 60% on the visually complex Atari games while keeping
comparable performances to those of vanilla MCTS.
2 Related Work
2.1 Reinforcement Learning with MCTS
For a long time, Computer Go has been regarded as a remarkably challenging game [
3
,
6
]. Researchers
attempt to use Monte-Carlo techniques that evaluate the value of the node state through random
playouts [
4
,
11
,
12
,
30
]. Afterward, UCT algorithms have generally been applied in Monte-Carlo
tree search (MCTS) algorithms, which use UCB1 to select action at each node of the tree [
20
].
Recently, MCTS-based RL methods [
30
,
32
,
31
,
27
] have become increasingly popular and achieved
super-human performances on board games because of their strong ability to search.
Modern MCTS-based RL algorithms include four stages in the
search loop
: selection, expansion,
evaluation, and backpropagation. The computation bottlenecks in vanilla MCTS come from the
search loop, especially for the evaluation stage and the selection stage of each iteration. The selection
stage is time-consuming when the search tree becomes wider and deeper. The evaluation stage is
quite expensive because people attempt to evaluate the node value by random playouts to the end in
previous researches. Due to the search loop, MCTS-based algorithms have multiple model inferences
compared to other model-free RL methods like PPO [28] and SAC [16].
2.2 Acceleration of MCTS
MCTS-based methods have proved their strong capability of solving complex games or tasks.
However, the high computational cost of MCTS hinders its application to some real-time and more
general scenarios. Therefore, numerous works are devoted to accelerating MCTS. For example, to
make the selection stage more effective, some heuristic pruning methods [
14
,
33
,
29
,
1
,
2
] aim to
reduce the width and depth of the search tree with some heuristic functions. Furthermore, for more
efficient evaluations, Lorentz [
22
] proposed early playout termination of MCTS (MCTS-EPT) to stop
the random playouts early and use an evaluation function to assess win or loss. Moreover, Hsueh et al.
[
18
] applied MCTS-EPT to the Chinese dark chess and proved its effectiveness. Afterward, similar
ideas have been applied in the evaluation stage of AlphaGoZero [
32
] and later MCTS-based methods
[
31
,
27
,
34
]. They evaluate the
Q
-values through a learnable evaluation network instead of running
playouts to the end. Grill et al. [
15
] propose a novel regularized policy optimization method based
on AlphaZero to decrease the search budget of MCTS, which is from the optimization perspective.
Danihelka et al. [
10
] propose a policy improvement algorithm based on sampling actions without
replacement, named Gumbel trick to achieve better performance when planning with few simulations.
However, these methods mentioned above focus on the specific stage of the search iteration or reduce
the total budget through pruning and optimization methods, which are orthogonal to us. And few
works targets at the search loop. Lan et al. [
21
] propose DS-MCTS, which defines the uncertainty of
2