
State abstractions in MCTS Our approach to clustering simi-
lar states into stochastic action bandit nodes is closely related to
state abstraction and aggregation methods. State abstraction helps
mitigate the impact of the combinatorial explosion in planning [
1
].
In state aggregation, many states are combined into abstract ones
[
44
,
71
] and it has been proposed for handling the high branching
factor in stochastic environments [
36
]. Hierarchical MCTS through
action abstraction is closely related [
6
]. Progressive Abstraction
Renement in Sparse Sampling (PARSS) builds and renes state
abstractions as the search progresses [
37
]. However, these methods,
including PARSS, generally assume a discrete action space, incom-
patible with our setup. State abstractions can also be used in a
continuous setting when combined with progressive widening [
65
].
Our results indicate that relying on progressive widening makes the
proposed method uncompetitive with CMCGS. Elastic MCTS uses
state abstractions for strategy game playing, but lacks a progressive
widening or discretization strategy to handle the high-dimensional
continuous action spaces prevalent in our experiments [74].
There are also learning-based approaches to online planning
that represent the policy using a neural network and sample from
it when expanding the tree [
2
,
38
]. By leveraging the learned policy,
these methods achieve promising performance with limited search
budgets. However, we focus only on the core search component and
leave comparisons with learning-based methods to future work.
3 PRELIMINARIES
Planning and search is arguably the most classic approach for opti-
mal control [
19
,
48
,
69
]. The idea is to start with a (rough) estimation
of the action trajectory and gradually improve it through model-
based simulation of the environment. In the case of environments
with complex dynamics, this is usually done online by optimizing
only a small part of the action trajectory at each step. Then, the
rst action of the optimized trajectory is executed. After that, the
whole process is repeated, starting from the newly visited state. This
process is also known as closed-loop control, or model-predictive
control (MPC). MPC is an eective approach for optimal control in
complex real-time environments [5, 21, 60].
Monte Carlo Tree Search (MCTS) [
15
] is one of the most popular
MPC algorithms in environments such as video games [
35
,
53
] that
require principled exploration. MCTS starts by initializing a single-
node search tree using the current state
𝑠𝑡
as the root and grows
the tree in an iterative manner, where one node is added to the tree
at each iteration. The iterative process of MCTS is comprised of the
following four key steps [9]:
(1) Selection (Tree Policy): In this step, the algorithm selects
one of the tree nodes to be expanded next. This is done by
starting from the root and navigating to a child node until a
node with at least one unexpanded child is visited using a
selection criterion, whose goal is to balance the exploration-
exploitation trade-o.
(2)
Expansion: In this step, MCTS expands the search tree by
adding a (randomly chosen) unvisited child to the selected
node from the previous step.
(3)
Simulation (Default Policy): After expanding the selected
node, the newly added node is evaluated using a trajectory
of random actions and computing the return.
(4)
Backup: In the nal step, the computed return is backed up
through the navigated nodes, updating the statistics stored
in each node.
For a more detailed description of the MCTS algorithm, the
reader is referred to Browne et al. [9].
4 CONTINUOUS MONTE CARLO GRAPH
SEARCH
We propose an algorithm for continuous control called Continuous
Monte Carlo Graph Search (CMCGS). We assume a discrete-time,
discounted, and fully-observable Markov Decision Process (MDP)
𝑀=⟨S,A, 𝑃, 𝑅, 𝛾⟩
with continuous states and actions.
S
is the state
space,
A
the action space,
𝑃
the transition probability function,
𝑅
the reward function, and
𝛾∈ [
0
,
1
]
the discount factor. The
objective of CMCGS is to nd the action
𝑎∈ A
that maximizes
the return in
𝑀
by performing closed-loop control with model-
based simulation of the environment. CMCGS is an iterative online
planning algorithm in which each iteration contains the same four
steps as MCTS with an additional width expansion step (see Fig. 1).
Instead of building a search tree, CMCGS performs a search on a
layered graph in which each layer corresponds to one step of the
underlying MDP. We assume that we have access to a dynamics
model
𝑝(𝒔𝑡+1, 𝑟𝑡+1|𝒔𝑡,𝒂𝑡)
that can either be an exact model of the
environment or a learned approximation.
The core idea of CMCGS is to use state clustering, Gaussian
action bandits, and a layered directed search graph to tackle the
challenges posed by search state explosion and high-dimensional
continuous states and actions. Each node
𝑞
in the
𝑡
-th layer of the
CMCGS graph corresponds to a cluster of states visited at timestep
𝑡
.
The node is characterized by its state distribution
𝑝𝑞(𝒔𝑡)
and policy
𝜋𝑞(𝒂𝑡)
, which describe the states assigned to this cluster and the
preferred actions taken from this cluster during planning. Both dis-
tributions
𝑝𝑞(𝒔𝑡)
and
𝜋𝑞(𝒂𝑡)
are modeled as Gaussian distributions
with diagonal covariance matrices. The policy
𝜋𝑞(𝒂𝑡)
is initialized
as a Gaussian
N (𝜇=1
2(𝒂min +𝒂max), 𝜎 =1
2(𝒂max −𝒂min))
. During
planning, the distributions are t to the data in the replay buer.
The replay buer contains tuples
𝑒=(𝒔𝑡,𝒂𝑡,𝒔𝑡+1, 𝑅, 𝑞)
, where
𝑞
is
the node to which state
𝒔𝑡
belongs and
𝑅
is the observed trajectory
return. We denote by
D𝑞
all the tuples in the replay buer that
correspond to node 𝑞and by 𝑄𝑡the set of nodes in layer 𝑡.
The search graph is initialized with
𝑑init
layers and one node per
layer. Initializing the graph with multiple layers improves the early
search eciency. The CMCGS algorithm then iterates the following
steps (see Fig. 1 for a graphical illustration).
(a)
Selection: The selection mechanism is repeatedly applied to
navigate through the search graph starting from the root node
until a sink node (a node without children) or a terminal state
of the MDP is reached (Fig. 1a). When navigating through the
graph, we use an epsilon-greedy-inspired policy: with probabil-
ity
𝜖
, we sample an action from the node policy, the Gaussian
distribution
𝜋𝑞(𝒂)
, and with probability 1
−𝜖
, we uniformly
sample one of the top actions in
D𝑞
(greedy action) and mod-
ify it with a small Gaussian noise for local improvement
1
:
1
This approach enables the agent to leverage past experience eciently while exploring
new options, leading to better performance on complex tasks.