Continuous Monte Carlo Graph Search_2

2025-04-27 0 0 3.22MB 22 页 10玖币

侵权投诉

Continuous Monte Carlo Graph Search

Kalle Kujanpää∗

Aalto University

Espoo, Finland

kalle.kujanpaa@aalto.

Amin Babadi∗

Bugbear Entertainment,

Aalto University

Helsinki, Finland

amin.babadi@bugbear.

Yi Zhao

Aalto University

Espoo, Finland

yi.zhao@aalto.

Juho Kannala

Aalto University

Espoo, Finland

juho.kannala@aalto.

Alexander Ilin

Aalto University, System 2 AI

Espoo, Finland

alexander.ilin@aalto.

Joni Pajarinen

Aalto University

Espoo, Finland

joni.pajarinen@aalto.

ABSTRACT

Online planning is crucial for high performance in many complex

sequential decision-making tasks. Monte Carlo Tree Search (MCTS)

employs a principled mechanism for trading o exploration for

exploitation for ecient online planning, and it outperforms com-

parison methods in many discrete decision-making domains such

as Go, Chess, and Shogi. Subsequently, extensions of MCTS to con-

tinuous domains have been developed. However, the inherent high

branching factor and the resulting explosion of the search tree

size are limiting the existing methods. To address this problem, we

propose Continuous Monte Carlo Graph Search (CMCGS), an exten-

sion of MCTS to online planning in environments with continuous

state and action spaces. CMCGS takes advantage of the insight that,

during planning, sharing the same action policy between several

states can yield high performance. To implement this idea, at each

time step, CMCGS clusters similar states into a limited number of

stochastic action bandit nodes, which produce a layered directed

graph instead of an MCTS search tree. Experimental evaluation

shows that CMCGS outperforms comparable planning methods in

several complex continuous DeepMind Control Suite benchmarks

and 2D navigation and exploration tasks with limited sample bud-

gets. Furthermore, CMCGS can be scaled up through parallelization,

and it outperforms the Cross-Entropy Method (CEM) in continuous

control with learned dynamics models.

KEYWORDS

Continuous Control; Planning; Reinforcement Learning; Model-

Based Reinforcement Learning; MCTS; Online Planning

ACM Reference Format:

Kalle Kujanpää, Amin Babadi, Yi Zhao, Juho Kannala, Alexander Ilin, and Joni

Pajarinen. 2024. Continuous Monte Carlo Graph Search. In Proc. of the 23rd

International Conference on Autonomous Agents and Multiagent Systems (AA-

MAS 2024), Auckland, New Zealand, May 6 – 10, 2024, IFAAMAS, 22 pages.

∗Equal Contribution

Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems

(AAMAS 2024), N. Alechina, V. Dignum, M. Dastani, J.S. Sichman (eds.), May 6 – 10, 2024,

Multiagent Systems (www.ifaamas.org).

1 INTRODUCTION

Monte Carlo Tree Search (MCTS) is a well-known online planning

algorithm for solving decision-making problems in discrete action

spaces [

]. MCTS achieves super-human performance in vari-

ous domains such as Atari, Go, Chess, and Shogi when a learned

transition model is available [

]. Therefore, recent research has

tried to extend MCTS to environments with continuous state and

action spaces [26, 38, 40, 42, 57].

However, current MCTS approaches for continuous states and

actions are limited [

]. MCTS approaches that build the

search tree by discretizing the action space or otherwise limiting

the growth of the tree, such as progressive widening approaches [

], do not scale up well to high-dimensional action spaces and

complex problems due to the search space increasing exponentially

with the planning horizon and action dimensionality [

]. To

alleviate this problem, learning-based approaches [

] reduce

the required planning horizon when a suciently large number

of training samples is available but do not solve the underlying

problem of increasing search space size.

To solve the search space explosion problem in MCTS-based

continuous planning, this paper presents Continuous Monte Carlo

Graph Search (CMCGS), an extension of MCTS to the continuous

control problem. Similarly to MCTS, CMCGS employs an iterative

mechanism for building a search graph that yields the next action

to execute. At each iteration, a series of operators is used to grow

the search graph and update the information stored in the graph

nodes. CMCGS uses state clustering and Gaussian action bandits

to deal with the challenges posed by high-dimensional continuous

states and actions and limit the search space explosion.

Our experiments show that CMCGS outperforms strong MCTS

and other planning baselines in many challenging benchmarks

given limited interaction with the environment, including high-

dimensional environments from the DeepMind Control Suite (DMC)

[

]. We also show that CMCGS can be eciently parallelized to

scale up and demonstrate its generality and robustness with learned

dynamics models in high-dimensional visual tasks, where CMCGS

outperforms the Cross-Entropy Method (CEM) [

]. Moreover,

CMCGS shows strong performance by being superior to CEM in

environments that require complex exploration, and our algorithm

also performs well in dicult, high-dimensional environments with

informative rewards, where MCTS-based methods fail.

arXiv:2210.01426v3 [cs.AI] 7 Feb 2024

Figure 1: Core steps in one iteration of Continuous Monte Carlo Graph Search (CMCGS). a) Starting from the root node, the

graph is navigated via action sampling and node selection until a sink node is reached. b) If there is enough experience collected

in the nal layer of the graph and the maximum depth has not been reached, a new layer containing a new node N is initialized.

c) A trajectory of random actions is simulated from the graph’s sink node to approximate the value of the state. d) The computed

accumulated reward is backed up through the selected nodes, updating their replay memories, policies, and state distributions.

e) If a new cluster of experience data is found in a previous layer of the graph, all nodes in that layer are updated based on the

new clustering information (in this example, the node N is split into two new nodes N1 and N2).

2 RELATED WORK

Monte Carlo Tree Search (MCTS), a decision-time planning method

in discrete action spaces [

], is a core component of the

success of the computer Go game [

]. MCTS also shows

advantages in general game playing [

]. Several recent

works combine MCTS with a learned dynamic model lifting the

requirement for access to a simulator [

]. Grouping similar

or equivalent states reached via dierent trajectories in MCTS has

been proposed [

], but recent methods mostly focus on the discrete

state setting [17, 43], which diers from our continuous setup.

Online planning in continuous action spaces. In continuous

action spaces, there can be an innite number of actions, which

signicantly increases the size of the MCTS search tree and makes

applying out-of-the-box MCTS infeasible. Instead, CEM is com-

monly used in the continuous action domain as an online planning

method [

]. Many methods combine CEM with a

value function or policy in dierent ways to improve its perfor-

mance [

]. The main limitation of CEM compared to

MCTS (and this work) is that CEM models the whole action trajec-

tory using a single sampling distribution. This could be translated

into the context of MCTS as using only one node at each layer of

the search tree, and it can limit the exploration capabilities of CEM

in environments where there are several ways of controlling the

agent, for example, going around an obstacle using two dierent

ways. Covariance Matrix Adaptation Evolution Strategy (CMA-ES)

is an evolutionary gradient-free algorithm for black-box continu-

ous optimization [

]. Although it has not seen much use in

continuous control due to its computational complexity and mem-

ory requirements, applying it as an alternative to CEM has been

proposed [18, 32, 50].

MCTS in continuous action spaces Adopting MCTS in con-

tinuous action spaces requires mitigating the requirement of enu-

merating all actions as in the discrete case. Several works use pro-

gressive widening to increase the number of child actions of a node

as a function of its visitation frequency [

]. The action

space can be split into bins by factorizing across action dimensions

[

]. A line of work extends MCTS to continuous action spaces

via Hierarchical Optimistic Optimization (HOO) [

HOO hierarchically partitions the action space by building a binary

tree to incrementally split the action space into smaller subspaces.

Voronoi Optimistic Optimization (VOO) builds on similar ideas

as HOO. It uses Voronoi partitioning to sample actions eciently

and is guaranteed to converge [

]. VG-UCT relies on action value

gradients to perform local policy improvement [

]. Furthermore,

several works grow the search tree based on sampling. Kernel re-

gression can be used to generalize the value estimation and, thus,

limit the number of actions sampled per node [

]. However, our

experiments indicate that despite the presented strategies, these

methods still struggle with the rapidly growing search space, es-

pecially with high-dimensional states and actions. We propose

tackling this challenge by representing the search space as a graph

instead of a tree, clustering similar states into search nodes, and

using Gaussian action bandits for the nodes. In practice, CMCGS

signicantly improves on the continuous MCTS methods, especially

with limited search budgets.

State abstractions in MCTS Our approach to clustering simi-

lar states into stochastic action bandit nodes is closely related to

state abstraction and aggregation methods. State abstraction helps

mitigate the impact of the combinatorial explosion in planning [

In state aggregation, many states are combined into abstract ones

[

] and it has been proposed for handling the high branching

factor in stochastic environments [

]. Hierarchical MCTS through

action abstraction is closely related [

]. Progressive Abstraction

Renement in Sparse Sampling (PARSS) builds and renes state

abstractions as the search progresses [

]. However, these methods,

including PARSS, generally assume a discrete action space, incom-

patible with our setup. State abstractions can also be used in a

continuous setting when combined with progressive widening [

Our results indicate that relying on progressive widening makes the

proposed method uncompetitive with CMCGS. Elastic MCTS uses

state abstractions for strategy game playing, but lacks a progressive

widening or discretization strategy to handle the high-dimensional

continuous action spaces prevalent in our experiments [74].

There are also learning-based approaches to online planning

that represent the policy using a neural network and sample from

it when expanding the tree [

]. By leveraging the learned policy,

these methods achieve promising performance with limited search

budgets. However, we focus only on the core search component and

leave comparisons with learning-based methods to future work.

3 PRELIMINARIES

Planning and search is arguably the most classic approach for opti-

mal control [

]. The idea is to start with a (rough) estimation

of the action trajectory and gradually improve it through model-

based simulation of the environment. In the case of environments

with complex dynamics, this is usually done online by optimizing

only a small part of the action trajectory at each step. Then, the

rst action of the optimized trajectory is executed. After that, the

whole process is repeated, starting from the newly visited state. This

process is also known as closed-loop control, or model-predictive

control (MPC). MPC is an eective approach for optimal control in

complex real-time environments [5, 21, 60].

Monte Carlo Tree Search (MCTS) [

] is one of the most popular

MPC algorithms in environments such as video games [

] that

require principled exploration. MCTS starts by initializing a single-

node search tree using the current state

𝑠𝑡

as the root and grows

the tree in an iterative manner, where one node is added to the tree

at each iteration. The iterative process of MCTS is comprised of the

following four key steps [9]:

(1) Selection (Tree Policy): In this step, the algorithm selects

one of the tree nodes to be expanded next. This is done by

starting from the root and navigating to a child node until a

node with at least one unexpanded child is visited using a

selection criterion, whose goal is to balance the exploration-

exploitation trade-o.

(2)

Expansion: In this step, MCTS expands the search tree by

adding a (randomly chosen) unvisited child to the selected

node from the previous step.

(3)

Simulation (Default Policy): After expanding the selected

node, the newly added node is evaluated using a trajectory

of random actions and computing the return.

(4)

Backup: In the nal step, the computed return is backed up

through the navigated nodes, updating the statistics stored

in each node.

For a more detailed description of the MCTS algorithm, the

reader is referred to Browne et al. [9].

4 CONTINUOUS MONTE CARLO GRAPH

We propose an algorithm for continuous control called Continuous

Monte Carlo Graph Search (CMCGS). We assume a discrete-time,

discounted, and fully-observable Markov Decision Process (MDP)

𝑀=⟨S,A, 𝑃, 𝑅, 𝛾⟩

with continuous states and actions.

is the state

space,

the action space,

𝑃

the transition probability function,

𝑅

the reward function, and

𝛾∈ [

]

the discount factor. The

objective of CMCGS is to nd the action

𝑎∈ A

that maximizes

the return in

𝑀

by performing closed-loop control with model-

based simulation of the environment. CMCGS is an iterative online

planning algorithm in which each iteration contains the same four

steps as MCTS with an additional width expansion step (see Fig. 1).

Instead of building a search tree, CMCGS performs a search on a

layered graph in which each layer corresponds to one step of the

underlying MDP. We assume that we have access to a dynamics

model

𝑝(𝒔𝑡+1, 𝑟𝑡+1|𝒔𝑡,𝒂𝑡)

that can either be an exact model of the

environment or a learned approximation.

The core idea of CMCGS is to use state clustering, Gaussian

action bandits, and a layered directed search graph to tackle the

challenges posed by search state explosion and high-dimensional

continuous states and actions. Each node

𝑞

in the

𝑡

-th layer of the

CMCGS graph corresponds to a cluster of states visited at timestep

𝑡

The node is characterized by its state distribution

𝑝𝑞(𝒔𝑡)

and policy

𝜋𝑞(𝒂𝑡)

, which describe the states assigned to this cluster and the

preferred actions taken from this cluster during planning. Both dis-

tributions

𝑝𝑞(𝒔𝑡)

and

𝜋𝑞(𝒂𝑡)

are modeled as Gaussian distributions

with diagonal covariance matrices. The policy

𝜋𝑞(𝒂𝑡)

is initialized

as a Gaussian

N (𝜇=1

2(𝒂min +𝒂max), 𝜎 =1

2(𝒂max −𝒂min))

. During

planning, the distributions are t to the data in the replay buer.

The replay buer contains tuples

𝑒=(𝒔𝑡,𝒂𝑡,𝒔𝑡+1, 𝑅, 𝑞)

, where

𝑞

the node to which state

𝒔𝑡

belongs and

𝑅

is the observed trajectory

return. We denote by

D𝑞

all the tuples in the replay buer that

correspond to node 𝑞and by 𝑄𝑡the set of nodes in layer 𝑡.

The search graph is initialized with

𝑑init

layers and one node per

layer. Initializing the graph with multiple layers improves the early

search eciency. The CMCGS algorithm then iterates the following

steps (see Fig. 1 for a graphical illustration).

(a)

Selection: The selection mechanism is repeatedly applied to

navigate through the search graph starting from the root node

until a sink node (a node without children) or a terminal state

of the MDP is reached (Fig. 1a). When navigating through the

graph, we use an epsilon-greedy-inspired policy: with probabil-

ity

𝜖

, we sample an action from the node policy, the Gaussian

distribution

𝜋𝑞(𝒂)

, and with probability 1

−𝜖

, we uniformly

sample one of the top actions in

D𝑞

(greedy action) and mod-

ify it with a small Gaussian noise for local improvement

This approach enables the agent to leverage past experience eciently while exploring

new options, leading to better performance on complex tasks.

𝜖top ∼ N (

,Etop ·(𝒂max −𝒂min))

. The current state

𝒔𝑡

is updated

𝒔𝑡+1

according to the dynamics model

𝑝(𝒔𝑡+1, 𝑟𝑡+1|𝒔𝑡,𝒂𝑡)

, and

the node in the next layer is selected by maximizing the probabil-

ity of the updated state according to the node state distributions:

𝑞←arg max

𝑞∈𝑄𝑡+1

𝑝𝑞(𝒔𝑡+1)

(b)

Depth expansion: If the last layer of the graph has collected

enough experience (the number of samples is greater than

threshold

𝑚

) and the maximum graph depth has not been

reached, we add a new layer to the graph (Fig. 1b). The new

layer contains a single node initially. Without the threshold

𝑚

the depth of the search graph would grow too fast.

(c)

Simulation: Similar to MCTS, we simulate a random trajectory

using random actions starting from the sink node and compute

the trajectory return

𝑅

(Fig. 1c). The length of the rollout is con-

trolled by a hyperparameter

𝑁𝑟

, but the rollout is interrupted if

a terminal state is encountered.

(d)

Backup: We traverse the search graph backward and update

the node distributions

𝑝𝑞(𝒔)

and

𝜋𝑞(𝒂)

for all the nodes visited

in the selection procedure (Fig. 1d). The replay buers

D𝑞

the visited nodes are updated. For each aected node

𝑞

, the

Gaussian state distribution

𝑝𝑞(𝒔)

is tted to the states

𝒔

in the

node replay buer

D𝑞

such that its mean is equal to the state

mean and standard deviation to the state standard deviation.

For the action distributions

𝜋𝑞(𝒂)

, we use a threshold-based

strategy to encourage sucient exploration: the node policy

𝜋𝑞(𝒂)

is updated only if the replay buer of the node

𝑞

contains

enough experience (

|D𝑞|>𝑚/

2). The dimension means of

𝜋𝑞(𝒂)

are tted to the highest-scoring (elite) experiences in

D𝑞

similarly to CEM. To update the dimension variances of

𝜋𝑞(𝒂)

we adopt the Bayes’ rule by dening a conjugate prior (inverse

gamma) on the variances, compute the posteriors using the

elite samples, and select the posterior means as the variances.

This prevents the variance from decaying too fast. Formally,

we assume that the dimension variances

𝜎2

𝜋𝑞(𝒂)

follow the

inverse gamma distribution with prior hyperparameters

𝛼

and

𝛽

. Given

𝑛

elite samples

(𝒂𝑡,1, . . . , 𝒂𝑡,𝑛)

and the new dimension

means

𝜇

𝜋𝑞(𝒂)

, the posterior hyperparameters of the inverse

gamma distribution are:

𝛼posterior =𝛼prior +𝑛

2, 𝛽posterior =𝛽prior +Í𝑛

𝑖=1(𝒂𝑡,𝑖 −𝜇)2

and the new dimension variances

𝜎2

𝜋𝑞(𝒂)

are equal to the

means of the inverse gamma posterior distributions:

𝜎2=

𝛽posterior

𝛼posterior −1.(1)

(e)

Width expansion: To prevent the rapid search graph growth

and maintain eciency, CMCGS computes the maximum de-

sired number 𝐶𝑡of nodes in layer 𝑡using a heuristic rule

𝐶𝑡=min (𝑛max,⌊𝑛𝑡/𝑚⌋)

where

𝑛𝑡

is the number of transitions collected at timestep

𝑡

and

stored in the replay buer

𝑚

the threshold hyperparameter,

and

𝑛max

the hard limit on the number of nodes. If

|𝑄𝑡|

, the

number of nodes in layer

𝑡

, is smaller than

𝐶𝑡

, we attempt to

cluster the states visited at timestep

𝑡

into

|𝑄𝑡| +

1clusters.

We use Agglomerative Clustering with Ward linkage in our

experiments due to its fast running time with the amount of

data needed by CMCGS [

]. Using Euclidean distance as

a metric is feasible both in low-dimensional state spaces and

learned latent spaces. The clustering is approved if each cluster

contains at least

𝑚/

2states to prevent degenerate clusters that

are unlikely to be selected for states encountered in the future.

Then, a new node is added to the layer (Fig. 1e), all experience

tuples in the replay buer are re-assigned to the corresponding

nodes and the state and action distributions

𝑝𝑞(𝒔)

and

𝜋𝑞(𝒂)

of the aected nodes are updated. Otherwise, no new node is

created, and clustering is attempted again when the layer has

collected

𝑚/

2new samples to limit the number of calls to the

clustering algorithm and, therefore, speed up the algorithm.

When the simulation budget has been exhausted, the agent’s

next action is determined by the experience stored at the root node.

If the search has access to a deterministic environment simulator,

the rst action of the trajectory with the highest return is chosen.

If searching with a learned dynamics model or the environment

is stochastic, the average of the top actions in the replay memory

of the root node is returned to prevent the exploitation of model

inaccuracies or individual samples. Each of the ve steps of CMCGS

can be performed eciently using batch operations, making it

possible to collect multiple trajectories in parallel. The pseudocode

of the CMCGS algorithm is shown in Algorithm 1.

5 EXPERIMENTS

We run four sets of experiments to evaluate the performance, gen-

erality, and robustness of CMCGS. First, we present a motivating

example where CEM fails due to its inability to explore properly in

an environment where a multimodal representation of the action

distribution is useful. Second, we evaluate CMCGS in continuous

control tasks with limited sample budgets to demonstrate that it

outperforms relevant baselines in a variety of tasks with dierent

levels of exploration required. Third, we demonstrate that CMCGS

can be parallelized and can successfully utilize learned models for

large-scale planning in complex tasks by showing that it achieves

signicantly better planning performance than CEM in image-based

continuous control environments. Finally, we perform extensive

ablation studies to better understand the design choices of CMCGS.

5.1 Toy Environment

We design a toy environment where the agent samples

𝑁

actions

𝑎1, . . . , 𝑎𝑁

sequentially and receives a basic reward of

𝑟=

5if

the absolute value of each action is greater than 1. An additional

reward of 0

5is given if the sign of all actions is the same. The

reward is formalized as

𝑟=









0.5,if ∀𝑖∈ {1, . . . , 𝑁 }:|𝑎𝑖|>1

1,if ∀𝑖∈ {1, . . . , 𝑁 }:|𝑎𝑖|>1∧

∀𝑖∈ {2, . . . , 𝑁 }:sgn(𝑎𝑖)=sgn(𝑎1)

All action bandits are initialized to

N (

)

to eliminate the eect

of the initial action distribution on the performance, the action

space is equal to

, and the state equal to the y-coordinate of the

agent (see Fig. 2). Finding the basic reward of

𝑟=

5is easy, but

performing principled exploration to nd the additional component

Algorithm 1 Continuous Monte Carlo Graph Search

1: function CMCGS(𝒔0)

2: 𝑞root ←Node(𝑠0, parent=None)

Initialize search graph with

𝑑init

layers and

𝑞root

as the root.

4: while within computational budget do

5: 𝜏, 𝑅, 𝒔𝑑←GraphPolicy(𝒔0, 𝑞root)

6: if 𝒔𝑑is not terminal then

7: 𝑅←𝑅+RandomRollout(𝒔𝑑)

8: Backup(𝜏, 𝑅)

9: return GetBestAction(𝑞root)

10: function GraphPolicy(𝒔, 𝑞)

11: 𝜏← ∅

12: 𝑅←0

13: 𝑑←1

14: 𝜙←(1, if U(0,1)<𝜖

0, otherwise

15: while 𝒔is not terminal and 𝑑≤𝑁layers do

16: 𝒂∼𝜙𝜋𝑞(𝒂)+(1−𝜙)𝜋𝑞,greedy (𝒂)

17: Apply action 𝒂, observe new state 𝒔′, reward 𝑟

18: 𝜏←𝜏∪ {⟨𝒔,𝒂,𝒔′, 𝑞⟩}

19: 𝑅←𝑅+𝑟

20: if 𝒔′is not terminal and 𝑑=𝑁layers then

21:

Try adding a new layer with one node to the graph.

22: if 𝑠′is terminal or 𝑑=𝑁layers then

23: break

24: 𝑞←arg max

𝑞∈𝑄𝑡+1

𝑝𝑞(𝒔′)

25: 𝒔←𝒔′

26: 𝑑←𝑑+1

27: return 𝜏, 𝑅, 𝒔

28: function Backup(𝜏, 𝑅)

29: for each ⟨𝒔𝑡,𝒂𝑡,𝒔𝑡+1, 𝑞𝑡⟩ ∈ 𝜏do

30: BufferStore(⟨𝒔𝑡,𝒂𝑡, 𝑅, 𝒔𝑡+1, 𝑞𝑡⟩)

31: Compute 𝐶𝑡, the desired number of clusters in layer 𝑡

32: if |𝑄𝑡|<𝐶𝑡then

33: Cluster states {𝒔𝑡}in layer 𝑡into |𝑄𝑡| + 1clusters

34: if clustering was successful then

35: Replace nodes in layer 𝑡with the new nodes

36: for 𝑞∈𝑄𝑡do

37: Estimate state distribution 𝑝𝑞(𝒔)

38: Estimate policy 𝜋𝑞(𝒂)

39: if no new cluster found in layer 𝑡then

40: Update 𝑝𝑞(𝒔)and 𝜋𝑞(𝒂)of the visited node 𝑞𝑡

is a challenge. Furthermore, being able to represent complex multi-

modal action distributions is a valuable asset in this environment.

We let

𝑁=

5and compare our method to random shooting and

CEM. The results are plotted in Table 1 and the agents’ behavior is

illustrated in Fig. 2. The results show that CEM struggles to explore

the environment properly, and CMCGS outperforms random shoot-

ing due to directed exploration. CEM sometimes fails to receive the

basic reward of 0

5due to the inability to represent the multimodal

action distribution natural to this problem.

Table 1: The success rates of achieving the small and large

rewards and the average rewards achieved by dierent al-

gorithms in the toy environment plus-minus two standard

errors.

Method 𝑟≥0.5𝑟=1.0Average Reward

CMCGS

1.00 0.99

0.995

±0.002

Random shooting

1.00

0.89 0.943 ±0.010

CEM 0.74 0.57 0.655 ±0.028

CMCGS (Success) CEM (Partial Failure) CEM (Failure)

Figure 2: An illustration of exploration in the toy environ-

ment. For clarity, the actions have been clipped to be in the

range

[−

]

. The agent, the chosen action, the explored tra-

jectories, and the dierent-sized rewards are represented by

the blue dot, red arrow, grey dashed lines, and green dots,

respectively. The search nodes of CMCGS and the correspond-

ing state and action means are illustrated with black dots

and arrows. The left image illustrates how CMCGS explores

with state-dependent policies. In the other pictures, we see

how CEM fails in the environment. CEM can either fail to

discover the large reward and choose a suboptimal action

(center) or completely fail to handle the multimodality re-

quired by the environment (right).

5.2 Continuous Control with Limited

Interaction

Second, we compare CMCGS to the strong planning algorithms

(1) Monte Carlo Tree Search with Progressive Widening [

MCTS-PW], (2) Voronoi Optimistic Optimization Applied to Trees

[

, VOOT], (3) MCTS with Iteratively Rened State Abstractions

(MCTS-ABS) [

, MCTS-ABS], (4) Cross-Entropy Method [

, CEM],

and (5) Covariance Matrix Adaptation Evolution Strategy [

, CMA-

ES] to demonstrate the power of CMCGS in continuous control

tasks when the number of environment interactions is a bottle-

neck. This can be the case in, for instance, real-life robotics or

when a computationally very intensive and accurate simulator is

used. In these tasks, we evaluate the planning performance of the

algorithms on real dynamics. We also include standard random

shooting (RS) as a baseline. More details about the baselines and

their implementations can be found in Appendix B.

The environments we use in these experiments include:

(1)

2D Navigation: We developed an environment for the con-

trol of a particle that should reach the goal (shown in green

in Fig. 3a) without colliding with the obstacles. We used

two variations of this environment, one with circular and

another with rectangular obstacles. In these environments, a

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ContinuousMonteCarloGraphSearchKalleKujanpää∗AaltoUniversityEspoo,Finlandkalle.kujanpaa@aalto.fiAminBabadi∗BugbearEntertainment,AaltoUniversityHelsinki,Finlandamin.babadi@bugbear.fiYiZhaoAaltoUniversityEspoo,Finlandyi.zhao@aalto.fiJuhoKannalaAaltoUniversityEspoo,Finlandjuho.kannala@aalto.fiAlexander...

展开>> 收起<<

Continuous Monte Carlo Graph Search_2.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Continuous Monte Carlo Graph Search_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: