Continuous Monte Carlo Graph Search_2

2025-04-27 0 0 3.22MB 22 页 10玖币
侵权投诉
Continuous Monte Carlo Graph Search
Kalle Kujanpää
Aalto University
Espoo, Finland
kalle.kujanpaa@aalto.
Amin Babadi
Bugbear Entertainment,
Aalto University
Helsinki, Finland
amin.babadi@bugbear.
Yi Zhao
Aalto University
Espoo, Finland
yi.zhao@aalto.
Juho Kannala
Aalto University
Espoo, Finland
juho.kannala@aalto.
Alexander Ilin
Aalto University, System 2 AI
Espoo, Finland
alexander.ilin@aalto.
Joni Pajarinen
Aalto University
Espoo, Finland
joni.pajarinen@aalto.
ABSTRACT
Online planning is crucial for high performance in many complex
sequential decision-making tasks. Monte Carlo Tree Search (MCTS)
employs a principled mechanism for trading o exploration for
exploitation for ecient online planning, and it outperforms com-
parison methods in many discrete decision-making domains such
as Go, Chess, and Shogi. Subsequently, extensions of MCTS to con-
tinuous domains have been developed. However, the inherent high
branching factor and the resulting explosion of the search tree
size are limiting the existing methods. To address this problem, we
propose Continuous Monte Carlo Graph Search (CMCGS), an exten-
sion of MCTS to online planning in environments with continuous
state and action spaces. CMCGS takes advantage of the insight that,
during planning, sharing the same action policy between several
states can yield high performance. To implement this idea, at each
time step, CMCGS clusters similar states into a limited number of
stochastic action bandit nodes, which produce a layered directed
graph instead of an MCTS search tree. Experimental evaluation
shows that CMCGS outperforms comparable planning methods in
several complex continuous DeepMind Control Suite benchmarks
and 2D navigation and exploration tasks with limited sample bud-
gets. Furthermore, CMCGS can be scaled up through parallelization,
and it outperforms the Cross-Entropy Method (CEM) in continuous
control with learned dynamics models.
KEYWORDS
Continuous Control; Planning; Reinforcement Learning; Model-
Based Reinforcement Learning; MCTS; Online Planning
ACM Reference Format:
Kalle Kujanpää, Amin Babadi, Yi Zhao, Juho Kannala, Alexander Ilin, and Joni
Pajarinen. 2024. Continuous Monte Carlo Graph Search. In Proc. of the 23rd
International Conference on Autonomous Agents and Multiagent Systems (AA-
MAS 2024), Auckland, New Zealand, May 6 – 10, 2024, IFAAMAS, 22 pages.
Equal Contribution
Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems
(AAMAS 2024), N. Alechina, V. Dignum, M. Dastani, J.S. Sichman (eds.), May 6 – 10, 2024,
Auckland, New Zealand.©2024 International Foundation for Autonomous Agents and
Multiagent Systems (www.ifaamas.org).
1 INTRODUCTION
Monte Carlo Tree Search (MCTS) is a well-known online planning
algorithm for solving decision-making problems in discrete action
spaces [
15
,
16
]. MCTS achieves super-human performance in vari-
ous domains such as Atari, Go, Chess, and Shogi when a learned
transition model is available [
61
]. Therefore, recent research has
tried to extend MCTS to environments with continuous state and
action spaces [26, 38, 40, 42, 57].
However, current MCTS approaches for continuous states and
actions are limited [
11
,
14
,
38
,
57
]. MCTS approaches that build the
search tree by discretizing the action space or otherwise limiting
the growth of the tree, such as progressive widening approaches [
11
,
14
,
16
], do not scale up well to high-dimensional action spaces and
complex problems due to the search space increasing exponentially
with the planning horizon and action dimensionality [
9
,
76
]. To
alleviate this problem, learning-based approaches [
38
,
57
] reduce
the required planning horizon when a suciently large number
of training samples is available but do not solve the underlying
problem of increasing search space size.
To solve the search space explosion problem in MCTS-based
continuous planning, this paper presents Continuous Monte Carlo
Graph Search (CMCGS), an extension of MCTS to the continuous
control problem. Similarly to MCTS, CMCGS employs an iterative
mechanism for building a search graph that yields the next action
to execute. At each iteration, a series of operators is used to grow
the search graph and update the information stored in the graph
nodes. CMCGS uses state clustering and Gaussian action bandits
to deal with the challenges posed by high-dimensional continuous
states and actions and limit the search space explosion.
Our experiments show that CMCGS outperforms strong MCTS
and other planning baselines in many challenging benchmarks
given limited interaction with the environment, including high-
dimensional environments from the DeepMind Control Suite (DMC)
[
70
]. We also show that CMCGS can be eciently parallelized to
scale up and demonstrate its generality and robustness with learned
dynamics models in high-dimensional visual tasks, where CMCGS
outperforms the Cross-Entropy Method (CEM) [
59
,
73
]. Moreover,
CMCGS shows strong performance by being superior to CEM in
environments that require complex exploration, and our algorithm
also performs well in dicult, high-dimensional environments with
informative rewards, where MCTS-based methods fail.
arXiv:2210.01426v3 [cs.AI] 7 Feb 2024
Figure 1: Core steps in one iteration of Continuous Monte Carlo Graph Search (CMCGS). a) Starting from the root node, the
graph is navigated via action sampling and node selection until a sink node is reached. b) If there is enough experience collected
in the nal layer of the graph and the maximum depth has not been reached, a new layer containing a new node N is initialized.
c) A trajectory of random actions is simulated from the graph’s sink node to approximate the value of the state. d) The computed
accumulated reward is backed up through the selected nodes, updating their replay memories, policies, and state distributions.
e) If a new cluster of experience data is found in a previous layer of the graph, all nodes in that layer are updated based on the
new clustering information (in this example, the node N is split into two new nodes N1 and N2).
2 RELATED WORK
Monte Carlo Tree Search (MCTS), a decision-time planning method
in discrete action spaces [
9
,
15
,
41
], is a core component of the
success of the computer Go game [
22
,
63
,
64
]. MCTS also shows
advantages in general game playing [
3
,
20
,
23
,
24
]. Several recent
works combine MCTS with a learned dynamic model lifting the
requirement for access to a simulator [
61
,
75
]. Grouping similar
or equivalent states reached via dierent trajectories in MCTS has
been proposed [
12
], but recent methods mostly focus on the discrete
state setting [17, 43], which diers from our continuous setup.
Online planning in continuous action spaces. In continuous
action spaces, there can be an innite number of actions, which
signicantly increases the size of the MCTS search tree and makes
applying out-of-the-box MCTS infeasible. Instead, CEM is com-
monly used in the continuous action domain as an online planning
method [
13
,
25
,
58
,
59
,
73
]. Many methods combine CEM with a
value function or policy in dierent ways to improve its perfor-
mance [
8
,
30
,
31
,
45
,
51
]. The main limitation of CEM compared to
MCTS (and this work) is that CEM models the whole action trajec-
tory using a single sampling distribution. This could be translated
into the context of MCTS as using only one node at each layer of
the search tree, and it can limit the exploration capabilities of CEM
in environments where there are several ways of controlling the
agent, for example, going around an obstacle using two dierent
ways. Covariance Matrix Adaptation Evolution Strategy (CMA-ES)
is an evolutionary gradient-free algorithm for black-box continu-
ous optimization [
28
,
29
]. Although it has not seen much use in
continuous control due to its computational complexity and mem-
ory requirements, applying it as an alternative to CEM has been
proposed [18, 32, 50].
MCTS in continuous action spaces Adopting MCTS in con-
tinuous action spaces requires mitigating the requirement of enu-
merating all actions as in the discrete case. Several works use pro-
gressive widening to increase the number of child actions of a node
as a function of its visitation frequency [
11
,
14
,
16
,
47
]. The action
space can be split into bins by factorizing across action dimensions
[
27
,
67
]. A line of work extends MCTS to continuous action spaces
via Hierarchical Optimistic Optimization (HOO) [
10
,
46
,
49
,
56
].
HOO hierarchically partitions the action space by building a binary
tree to incrementally split the action space into smaller subspaces.
Voronoi Optimistic Optimization (VOO) builds on similar ideas
as HOO. It uses Voronoi partitioning to sample actions eciently
and is guaranteed to converge [
40
]. VG-UCT relies on action value
gradients to perform local policy improvement [
42
]. Furthermore,
several works grow the search tree based on sampling. Kernel re-
gression can be used to generalize the value estimation and, thus,
limit the number of actions sampled per node [
76
]. However, our
experiments indicate that despite the presented strategies, these
methods still struggle with the rapidly growing search space, es-
pecially with high-dimensional states and actions. We propose
tackling this challenge by representing the search space as a graph
instead of a tree, clustering similar states into search nodes, and
using Gaussian action bandits for the nodes. In practice, CMCGS
signicantly improves on the continuous MCTS methods, especially
with limited search budgets.
State abstractions in MCTS Our approach to clustering simi-
lar states into stochastic action bandit nodes is closely related to
state abstraction and aggregation methods. State abstraction helps
mitigate the impact of the combinatorial explosion in planning [
1
].
In state aggregation, many states are combined into abstract ones
[
44
,
71
] and it has been proposed for handling the high branching
factor in stochastic environments [
36
]. Hierarchical MCTS through
action abstraction is closely related [
6
]. Progressive Abstraction
Renement in Sparse Sampling (PARSS) builds and renes state
abstractions as the search progresses [
37
]. However, these methods,
including PARSS, generally assume a discrete action space, incom-
patible with our setup. State abstractions can also be used in a
continuous setting when combined with progressive widening [
65
].
Our results indicate that relying on progressive widening makes the
proposed method uncompetitive with CMCGS. Elastic MCTS uses
state abstractions for strategy game playing, but lacks a progressive
widening or discretization strategy to handle the high-dimensional
continuous action spaces prevalent in our experiments [74].
There are also learning-based approaches to online planning
that represent the policy using a neural network and sample from
it when expanding the tree [
2
,
38
]. By leveraging the learned policy,
these methods achieve promising performance with limited search
budgets. However, we focus only on the core search component and
leave comparisons with learning-based methods to future work.
3 PRELIMINARIES
Planning and search is arguably the most classic approach for opti-
mal control [
19
,
48
,
69
]. The idea is to start with a (rough) estimation
of the action trajectory and gradually improve it through model-
based simulation of the environment. In the case of environments
with complex dynamics, this is usually done online by optimizing
only a small part of the action trajectory at each step. Then, the
rst action of the optimized trajectory is executed. After that, the
whole process is repeated, starting from the newly visited state. This
process is also known as closed-loop control, or model-predictive
control (MPC). MPC is an eective approach for optimal control in
complex real-time environments [5, 21, 60].
Monte Carlo Tree Search (MCTS) [
15
] is one of the most popular
MPC algorithms in environments such as video games [
35
,
53
] that
require principled exploration. MCTS starts by initializing a single-
node search tree using the current state
𝑠𝑡
as the root and grows
the tree in an iterative manner, where one node is added to the tree
at each iteration. The iterative process of MCTS is comprised of the
following four key steps [9]:
(1) Selection (Tree Policy): In this step, the algorithm selects
one of the tree nodes to be expanded next. This is done by
starting from the root and navigating to a child node until a
node with at least one unexpanded child is visited using a
selection criterion, whose goal is to balance the exploration-
exploitation trade-o.
(2)
Expansion: In this step, MCTS expands the search tree by
adding a (randomly chosen) unvisited child to the selected
node from the previous step.
(3)
Simulation (Default Policy): After expanding the selected
node, the newly added node is evaluated using a trajectory
of random actions and computing the return.
(4)
Backup: In the nal step, the computed return is backed up
through the navigated nodes, updating the statistics stored
in each node.
For a more detailed description of the MCTS algorithm, the
reader is referred to Browne et al. [9].
4 CONTINUOUS MONTE CARLO GRAPH
SEARCH
We propose an algorithm for continuous control called Continuous
Monte Carlo Graph Search (CMCGS). We assume a discrete-time,
discounted, and fully-observable Markov Decision Process (MDP)
𝑀=⟨S,A, 𝑃, 𝑅, 𝛾
with continuous states and actions.
S
is the state
space,
A
the action space,
𝑃
the transition probability function,
𝑅
the reward function, and
𝛾∈ [
0
,
1
]
the discount factor. The
objective of CMCGS is to nd the action
𝑎∈ A
that maximizes
the return in
𝑀
by performing closed-loop control with model-
based simulation of the environment. CMCGS is an iterative online
planning algorithm in which each iteration contains the same four
steps as MCTS with an additional width expansion step (see Fig. 1).
Instead of building a search tree, CMCGS performs a search on a
layered graph in which each layer corresponds to one step of the
underlying MDP. We assume that we have access to a dynamics
model
𝑝(𝒔𝑡+1, 𝑟𝑡+1|𝒔𝑡,𝒂𝑡)
that can either be an exact model of the
environment or a learned approximation.
The core idea of CMCGS is to use state clustering, Gaussian
action bandits, and a layered directed search graph to tackle the
challenges posed by search state explosion and high-dimensional
continuous states and actions. Each node
𝑞
in the
𝑡
-th layer of the
CMCGS graph corresponds to a cluster of states visited at timestep
𝑡
.
The node is characterized by its state distribution
𝑝𝑞(𝒔𝑡)
and policy
𝜋𝑞(𝒂𝑡)
, which describe the states assigned to this cluster and the
preferred actions taken from this cluster during planning. Both dis-
tributions
𝑝𝑞(𝒔𝑡)
and
𝜋𝑞(𝒂𝑡)
are modeled as Gaussian distributions
with diagonal covariance matrices. The policy
𝜋𝑞(𝒂𝑡)
is initialized
as a Gaussian
N (𝜇=1
2(𝒂min +𝒂max), 𝜎 =1
2(𝒂max 𝒂min))
. During
planning, the distributions are t to the data in the replay buer.
The replay buer contains tuples
𝑒=(𝒔𝑡,𝒂𝑡,𝒔𝑡+1, 𝑅, 𝑞)
, where
𝑞
is
the node to which state
𝒔𝑡
belongs and
𝑅
is the observed trajectory
return. We denote by
D𝑞
all the tuples in the replay buer that
correspond to node 𝑞and by 𝑄𝑡the set of nodes in layer 𝑡.
The search graph is initialized with
𝑑init
layers and one node per
layer. Initializing the graph with multiple layers improves the early
search eciency. The CMCGS algorithm then iterates the following
steps (see Fig. 1 for a graphical illustration).
(a)
Selection: The selection mechanism is repeatedly applied to
navigate through the search graph starting from the root node
until a sink node (a node without children) or a terminal state
of the MDP is reached (Fig. 1a). When navigating through the
graph, we use an epsilon-greedy-inspired policy: with probabil-
ity
𝜖
, we sample an action from the node policy, the Gaussian
distribution
𝜋𝑞(𝒂)
, and with probability 1
𝜖
, we uniformly
sample one of the top actions in
D𝑞
(greedy action) and mod-
ify it with a small Gaussian noise for local improvement
1
:
1
This approach enables the agent to leverage past experience eciently while exploring
new options, leading to better performance on complex tasks.
𝜖top N (
0
,Etop ·(𝒂max 𝒂min))
. The current state
𝒔𝑡
is updated
to
𝒔𝑡+1
according to the dynamics model
𝑝(𝒔𝑡+1, 𝑟𝑡+1|𝒔𝑡,𝒂𝑡)
, and
the node in the next layer is selected by maximizing the probabil-
ity of the updated state according to the node state distributions:
𝑞arg max
𝑞𝑄𝑡+1
𝑝𝑞(𝒔𝑡+1)
(b)
Depth expansion: If the last layer of the graph has collected
enough experience (the number of samples is greater than
threshold
𝑚
) and the maximum graph depth has not been
reached, we add a new layer to the graph (Fig. 1b). The new
layer contains a single node initially. Without the threshold
𝑚
,
the depth of the search graph would grow too fast.
(c)
Simulation: Similar to MCTS, we simulate a random trajectory
using random actions starting from the sink node and compute
the trajectory return
𝑅
(Fig. 1c). The length of the rollout is con-
trolled by a hyperparameter
𝑁𝑟
, but the rollout is interrupted if
a terminal state is encountered.
(d)
Backup: We traverse the search graph backward and update
the node distributions
𝑝𝑞(𝒔)
and
𝜋𝑞(𝒂)
for all the nodes visited
in the selection procedure (Fig. 1d). The replay buers
D𝑞
of
the visited nodes are updated. For each aected node
𝑞
, the
Gaussian state distribution
𝑝𝑞(𝒔)
is tted to the states
𝒔
in the
node replay buer
D𝑞
such that its mean is equal to the state
mean and standard deviation to the state standard deviation.
For the action distributions
𝜋𝑞(𝒂)
, we use a threshold-based
strategy to encourage sucient exploration: the node policy
𝜋𝑞(𝒂)
is updated only if the replay buer of the node
𝑞
contains
enough experience (
|D𝑞|>𝑚/
2). The dimension means of
𝜋𝑞(𝒂)
are tted to the highest-scoring (elite) experiences in
D𝑞
,
similarly to CEM. To update the dimension variances of
𝜋𝑞(𝒂)
,
we adopt the Bayes’ rule by dening a conjugate prior (inverse
gamma) on the variances, compute the posteriors using the
elite samples, and select the posterior means as the variances.
This prevents the variance from decaying too fast. Formally,
we assume that the dimension variances
𝜎2
of
𝜋𝑞(𝒂)
follow the
inverse gamma distribution with prior hyperparameters
𝛼
and
𝛽
. Given
𝑛
elite samples
(𝒂𝑡,1, . . . , 𝒂𝑡,𝑛)
and the new dimension
means
𝜇
of
𝜋𝑞(𝒂)
, the posterior hyperparameters of the inverse
gamma distribution are:
𝛼posterior =𝛼prior +𝑛
2, 𝛽posterior =𝛽prior +Í𝑛
𝑖=1(𝒂𝑡,𝑖 𝜇)2
2
and the new dimension variances
𝜎2
of
𝜋𝑞(𝒂)
are equal to the
means of the inverse gamma posterior distributions:
𝜎2=
𝛽posterior
𝛼posterior 1.(1)
(e)
Width expansion: To prevent the rapid search graph growth
and maintain eciency, CMCGS computes the maximum de-
sired number 𝐶𝑡of nodes in layer 𝑡using a heuristic rule
𝐶𝑡=min (𝑛max,𝑛𝑡/𝑚)
where
𝑛𝑡
is the number of transitions collected at timestep
𝑡
and
stored in the replay buer
D
,
𝑚
the threshold hyperparameter,
and
𝑛max
the hard limit on the number of nodes. If
|𝑄𝑡|
, the
number of nodes in layer
𝑡
, is smaller than
𝐶𝑡
, we attempt to
cluster the states visited at timestep
𝑡
into
|𝑄𝑡| +
1clusters.
We use Agglomerative Clustering with Ward linkage in our
experiments due to its fast running time with the amount of
data needed by CMCGS [
52
,
72
]. Using Euclidean distance as
a metric is feasible both in low-dimensional state spaces and
learned latent spaces. The clustering is approved if each cluster
contains at least
𝑚/
2states to prevent degenerate clusters that
are unlikely to be selected for states encountered in the future.
Then, a new node is added to the layer (Fig. 1e), all experience
tuples in the replay buer are re-assigned to the corresponding
nodes and the state and action distributions
𝑝𝑞(𝒔)
and
𝜋𝑞(𝒂)
of the aected nodes are updated. Otherwise, no new node is
created, and clustering is attempted again when the layer has
collected
𝑚/
2new samples to limit the number of calls to the
clustering algorithm and, therefore, speed up the algorithm.
When the simulation budget has been exhausted, the agent’s
next action is determined by the experience stored at the root node.
If the search has access to a deterministic environment simulator,
the rst action of the trajectory with the highest return is chosen.
If searching with a learned dynamics model or the environment
is stochastic, the average of the top actions in the replay memory
of the root node is returned to prevent the exploitation of model
inaccuracies or individual samples. Each of the ve steps of CMCGS
can be performed eciently using batch operations, making it
possible to collect multiple trajectories in parallel. The pseudocode
of the CMCGS algorithm is shown in Algorithm 1.
5 EXPERIMENTS
We run four sets of experiments to evaluate the performance, gen-
erality, and robustness of CMCGS. First, we present a motivating
example where CEM fails due to its inability to explore properly in
an environment where a multimodal representation of the action
distribution is useful. Second, we evaluate CMCGS in continuous
control tasks with limited sample budgets to demonstrate that it
outperforms relevant baselines in a variety of tasks with dierent
levels of exploration required. Third, we demonstrate that CMCGS
can be parallelized and can successfully utilize learned models for
large-scale planning in complex tasks by showing that it achieves
signicantly better planning performance than CEM in image-based
continuous control environments. Finally, we perform extensive
ablation studies to better understand the design choices of CMCGS.
5.1 Toy Environment
We design a toy environment where the agent samples
𝑁
actions
𝑎1, . . . , 𝑎𝑁
sequentially and receives a basic reward of
𝑟=
0
.
5if
the absolute value of each action is greater than 1. An additional
reward of 0
.
5is given if the sign of all actions is the same. The
reward is formalized as
𝑟=
0.5,if 𝑖∈ {1, . . . , 𝑁 }:|𝑎𝑖|>1
1,if 𝑖∈ {1, . . . , 𝑁 }:|𝑎𝑖|>1
𝑖∈ {2, . . . , 𝑁 }:sgn(𝑎𝑖)=sgn(𝑎1)
All action bandits are initialized to
N (
0
,
1
)
to eliminate the eect
of the initial action distribution on the performance, the action
space is equal to
R
, and the state equal to the y-coordinate of the
agent (see Fig. 2). Finding the basic reward of
𝑟=
0
.
5is easy, but
performing principled exploration to nd the additional component
Algorithm 1 Continuous Monte Carlo Graph Search
1: function CMCGS(𝒔0)
2: 𝑞root Node(𝑠0, parent=None)
3:
Initialize search graph with
𝑑init
layers and
𝑞root
as the root.
4: while within computational budget do
5: 𝜏, 𝑅, 𝒔𝑑GraphPolicy(𝒔0, 𝑞root)
6: if 𝒔𝑑is not terminal then
7: 𝑅𝑅+RandomRollout(𝒔𝑑)
8: Backup(𝜏, 𝑅)
9: return GetBestAction(𝑞root)
10: function GraphPolicy(𝒔, 𝑞)
11: 𝜏← ∅
12: 𝑅0
13: 𝑑1
14: 𝜙(1, if U(0,1)<𝜖
0, otherwise
15: while 𝒔is not terminal and 𝑑𝑁layers do
16: 𝒂𝜙𝜋𝑞(𝒂)+(1𝜙)𝜋𝑞,greedy (𝒂)
17: Apply action 𝒂, observe new state 𝒔, reward 𝑟
18: 𝜏𝜏∪ {𝒔,𝒂,𝒔, 𝑞}
19: 𝑅𝑅+𝑟
20: if 𝒔is not terminal and 𝑑=𝑁layers then
21:
Try adding a new layer with one node to the graph.
22: if 𝑠is terminal or 𝑑=𝑁layers then
23: break
24: 𝑞arg max
𝑞𝑄𝑡+1
𝑝𝑞(𝒔)
25: 𝒔𝒔
26: 𝑑𝑑+1
27: return 𝜏, 𝑅, 𝒔
28: function Backup(𝜏, 𝑅)
29: for each 𝒔𝑡,𝒂𝑡,𝒔𝑡+1, 𝑞𝑡 𝜏do
30: BufferStore(𝒔𝑡,𝒂𝑡, 𝑅, 𝒔𝑡+1, 𝑞𝑡)
31: Compute 𝐶𝑡, the desired number of clusters in layer 𝑡
32: if |𝑄𝑡|<𝐶𝑡then
33: Cluster states {𝒔𝑡}in layer 𝑡into |𝑄𝑡| + 1clusters
34: if clustering was successful then
35: Replace nodes in layer 𝑡with the new nodes
36: for 𝑞𝑄𝑡do
37: Estimate state distribution 𝑝𝑞(𝒔)
38: Estimate policy 𝜋𝑞(𝒂)
39: if no new cluster found in layer 𝑡then
40: Update 𝑝𝑞(𝒔)and 𝜋𝑞(𝒂)of the visited node 𝑞𝑡
is a challenge. Furthermore, being able to represent complex multi-
modal action distributions is a valuable asset in this environment.
We let
𝑁=
5and compare our method to random shooting and
CEM. The results are plotted in Table 1 and the agents’ behavior is
illustrated in Fig. 2. The results show that CEM struggles to explore
the environment properly, and CMCGS outperforms random shoot-
ing due to directed exploration. CEM sometimes fails to receive the
basic reward of 0
.
5due to the inability to represent the multimodal
action distribution natural to this problem.
Table 1: The success rates of achieving the small and large
rewards and the average rewards achieved by dierent al-
gorithms in the toy environment plus-minus two standard
errors.
Method 𝑟0.5𝑟=1.0Average Reward
CMCGS
1.00 0.99
0.995
±0.002
Random shooting
1.00
0.89 0.943 ±0.010
CEM 0.74 0.57 0.655 ±0.028
CMCGS (Success) CEM (Partial Failure) CEM (Failure)
Figure 2: An illustration of exploration in the toy environ-
ment. For clarity, the actions have been clipped to be in the
range
[
1
,
1
]
. The agent, the chosen action, the explored tra-
jectories, and the dierent-sized rewards are represented by
the blue dot, red arrow, grey dashed lines, and green dots,
respectively. The search nodes of CMCGS and the correspond-
ing state and action means are illustrated with black dots
and arrows. The left image illustrates how CMCGS explores
with state-dependent policies. In the other pictures, we see
how CEM fails in the environment. CEM can either fail to
discover the large reward and choose a suboptimal action
(center) or completely fail to handle the multimodality re-
quired by the environment (right).
5.2 Continuous Control with Limited
Interaction
Second, we compare CMCGS to the strong planning algorithms
(1) Monte Carlo Tree Search with Progressive Widening [
11
,
14
,
16
,
MCTS-PW], (2) Voronoi Optimistic Optimization Applied to Trees
[
40
, VOOT], (3) MCTS with Iteratively Rened State Abstractions
(MCTS-ABS) [
65
, MCTS-ABS], (4) Cross-Entropy Method [
58
, CEM],
and (5) Covariance Matrix Adaptation Evolution Strategy [
28
, CMA-
ES] to demonstrate the power of CMCGS in continuous control
tasks when the number of environment interactions is a bottle-
neck. This can be the case in, for instance, real-life robotics or
when a computationally very intensive and accurate simulator is
used. In these tasks, we evaluate the planning performance of the
algorithms on real dynamics. We also include standard random
shooting (RS) as a baseline. More details about the baselines and
their implementations can be found in Appendix B.
The environments we use in these experiments include:
(1)
2D Navigation: We developed an environment for the con-
trol of a particle that should reach the goal (shown in green
in Fig. 3a) without colliding with the obstacles. We used
two variations of this environment, one with circular and
another with rectangular obstacles. In these environments, a
摘要:

ContinuousMonteCarloGraphSearchKalleKujanpää∗AaltoUniversityEspoo,Finlandkalle.kujanpaa@aalto.fiAminBabadi∗BugbearEntertainment,AaltoUniversityHelsinki,Finlandamin.babadi@bugbear.fiYiZhaoAaltoUniversityEspoo,Finlandyi.zhao@aalto.fiJuhoKannalaAaltoUniversityEspoo,Finlandjuho.kannala@aalto.fiAlexander...

展开>> 收起<<
Continuous Monte Carlo Graph Search_2.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:3.22MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注