MULTI -TASK OPTION LEARNING AND DISCOVERY FOR STOCHASTIC PATH PLANNING Naman Shah

2025-05-02 0 0 2.9MB 15 页 10玖币

侵权投诉

MULTI-TASK OPTION LEARNING AND DISCOVERY

FOR STOCHASTIC PATH PLANNING

Naman Shah

Arizona State University

Tempe, AZ, 85281

namanshah@asu.edu

Siddharth Srivastava

Arizona State University

Tempe, AZ, 85281

siddharths@asu.edu

ABSTRACT

This paper addresses the problem of reliably and efﬁciently solving broad classes

of long-horizon stochastic path planning problems. Starting with a vanilla RL

formulation with a stochastic dynamics simulator and an occupancy matrix of

the environment, our approach computes useful options with policies as well as

high-level paths that compose the discovered options.

Our main contributions are (

) data-driven methods for creating abstract states that

serve as end points for helpful options, (

) methods for computing option policies

using auto-generated option guides in the form of dense pseudo-reward functions,

and (

) an overarching algorithm for composing the computed options. We show

that this approach yields strong guarantees of executability and solvability: under

fairly general conditions, the computed option guides lead to composable option

policies and consequently ensure downward reﬁnability. Empirical evaluation on

a range of robots, environments, and tasks shows that this approach effectively

transfers knowledge across related tasks and that it outperforms existing approaches

by a signiﬁcant margin.

1 INTRODUCTION

Autonomous robots must compute long-horizon motion plans (or path plans) to accomplish their

tasks. Robots use controllers to execute these motion plans by reaching each point in the motion

plan. However, the physical dynamics can be noisy and controllers are not always able to achieve

precise trajectory targets. This prevents robots from deterministically reaching a goal while executing

the computed motion plan and increases the complexity of the motion planning problem. Several

approaches (Schaul et al., 2015; Pong et al., 2018) have used reinforcement learning (RL) to solve

multi-goal stochastic path planning problems by learning goal-conditioned reactive policies. However,

these approaches work only for short-horizon problems (Eysenbach et al., 2019). On the other hand,

multiple approaches have been designed for handling stochasticity in motion planning (Alterovitz

et al., 2007; Sun et al., 2016), but they require discrete actions for the robot. This paper addresses

the following question: Can we develop effective approaches that can efﬁciently compute plans for

long-horizon continuous stochastic path planning problems? In this paper, we show that we can

develop such an approach by learning abstract states and then learning options that serve as actions

between these abstract states.

Abstractions play an important role in long-horizon planning. Temporally abstracted high-level

actions reduce the horizon of the problem in order to reduce the complexity of the overall decision-

making problem. E.g., a task of reaching a location in a building can be solved using abstract

actions such as “go from room A to corridor A”, “reach elevator from corridor A”, etc., if one can

automatically identify these regions of saliency. Each of these actions is a temporally abstracted

action. Not only do these actions reduce the complexity of the problem, but they also allow the

transfer of knowledge across multiple tasks. E.g, if we learn how to reach room B from room A for a

task, we can reuse the same solution when this abstract action is required to solve some other task.

Reinforcement learning allows learning policies that account for the stochasticity of the environment.

Recent work (Lyu et al., 2019; Yang et al., 2018; Kokel et al., 2021) has shown that combining RL

with abstractions and symbolic planning has enabled robots to solve long-horizon problems that

arXiv:2210.00068v2 [cs.LG] 8 Dec 2022

require complex reasoning. However, these approaches require hand-coded abstractions. In this paper,

we show that the abstractions automatically learned (Shah & Srivastava, 2022) can be efﬁciently

combined with deep reinforcement learning approaches.

The main contributions to this paper are: (1) A formal foundation for constructing a library of

two different types of options that are task-independent and transferable, (2) A novel approach

for auto-generating dense pseudo-reward function in the form of option guides that can be used to

learn policies for synthesized options, and (3) An overall hierarchical algorithm approach that uses

combines these automatically synthesized abstract actions with reinforcement learning and uses them

for multi-task long-horizon continuous stochastic path planning problems. We also show that these

options are composable and can be used as abstract actions with high-level search algorithms.

Our formal approach provides theoretical guarantees about the composability of the options and their

executability using an option guide. We present an extensive evaluation of our approach using two

separates sets of automatically synthesized options in a total of

settings to answer three critical

questions: (1) Does this approach learn useful high-level planning representations? (2) Do these

learned representations support transferring learning to new tasks?

The rest of the paper is organized as follows: Sec. 2 some of the existing approaches that are closely

related to our approach; Sec. 3 introduces a few existing ideas used by our approach; Sec. 4 presents

our algorithm; Sec. 5 presents an extensive empirical evaluation of our approach.

2 RELATED WORK

To the best of our knowledge, this is the ﬁrst approach that uses a data-driven approach for synthesizing

transferable and composable options and leverages these options with a hierarchical algorithm to

compute solutions for stochastic path planning problems It builds upon the concepts of abstraction,

stochastic motion planning, option discovery, and hierarchical reinforcement learning and combines

reinforcement learning with planning. Here, we discuss related work from each of these areas.

Motion planning is a well-researched area. Numerous approaches( (Kavraki et al., 1996; LaValle,

1998; Kuffner & LaValle, 2000; Pivtoraiko et al., 2009; Saxena et al., 2022)) have been developed

for motion planning in deterministic environments. Kavraki et al. (1996); LaValle (1998); Kuffner

& LaValle (2000) develop sampling-based techniques that randomly sample conﬁgurations in the

environment and connect them for computing a motion plan from the initial and goal conﬁgurations.

Holte et al. (1996); Pivtoraiko et al. (2009); Saxena et al. (2022) discretize the conﬁguration space

and use search techniques such as A∗search to compute motion plans in the discrete space.

Multiple approaches (Du et al., 2010; Kurniawati et al., 2012; Vitus et al., 2012; Berg et al., 2017;

Hibbard et al., 2022) have been developed for performing motion planning with stochastic dynamics.

Alterovitz et al. (2007) build a weighted graph called stochastic motion roadmap (SMR) inspired from

the probabilistic roadmaps (PRM) (Kavraki et al., 1996) where the weights capture the probability of

the robot making the corresponding transition. Sun et al. (2016) use linear quadratic regulator -- a

linear controller that does not explicitly avoid collisions -- along with value iteration to compute a

trajectory that maximizes the expected reward. However, these approaches require an analytical model

of the transition probability of the robot’s dynamics. Tamar et al. (2016) develop a fully differentiable

neural module that approximates the value iteration and can be used for computing solutions for

stochastic path planning problems. However, these approaches (Alterovitz et al., 2007; Sun et al.,

2016; Tamar et al., 2016) require discretized actions. Du et al. (2010); Van Den Berg et al. (2012)

formulate a stochastic motion planning problem as a POMDP to capture the uncertainty in robot

sensing and movements. Multiple approaches (Jurgenson & Tamar, 2019; Eysenbach et al., 2019;

Jurgenson et al., 2020) design end-to-end reinforcement learning approaches for solving stochastic

motion planning problems. These approaches only learn policies to solve one path planning problem

at a time and do not transfer knowledge across multiple problems. In contrast, our approach does not

require discrete actions and learn options that are transferrable to different problems.

Several approaches have considered the problem of learning task-speciﬁc subgoals. Kulkarni et al.

(2016); Bacon et al. (2017); Nachum et al. (2018; 2019); Czechowski et al. (2021) use intrinsic

reward functions to learn a two-level hierarchical policy. The high-level policy predicts a subgoal

that the low-level goal-conditioned policy should achieve. The high-level and low-level policies

are then trained simultaneously using simulations in the environment. Paul et al. (2019) combine

imitation learning with reinforcement learning for identifying subgoals from expert trajectories and

bootstrap reinforcement learning. Levy et al. (2019) learn a multi-level policy where each level learns

subgoals for the policy at the lower level using Hindsight Experience Replay (HER) (Andrychowicz

et al., 2017) for control problems rather than long-horizon motion planning problems in deterministic

settings. Kim et al. (2021) randomly sample subgoals in the environment and use a path planning

algorithm to select the closest subgoal and learn a policy that achieves this subgoal.

Numerous approaches (Stolle & Precup, 2002; ¸Sim¸sek et al., 2005; Brunskill & Li, 2014; Kurutach

et al., 2018; Eysenbach et al., 2019; Bagaria & Konidaris, 2020; Bagaria et al., 2021) perform

hierarchical learning by identifying task-speciﬁc options through experience collected in the test

environment and then use these options (Sutton et al., 1999) along with low-level primitive actions.

Stolle & Precup (2002); ¸Sim¸sek et al. (2005) lay foundation for discovering options in discrete

settings. They collect trajectories in the environment and use them to identify high-frequency states in

the environment. These states are used as termination sets of the options and initiation sets are derived

by selecting states that lead to these high-frequency states. Once options are identiﬁed, they use

Q-learning to learn policies for these options independently to formulate Semi-MDPs (Sutton et al.,

1999). Bagaria & Konidaris (2020) learn options in a reverse fashion. They compute trajectories in

the environment that reaches the goal state. In these trajectories, they use the last

points to deﬁne

an option. They use these points to deﬁne the initiation set of the option and the goal state is used as

the termination set. They continue to partition rest of collected trajectories similarly and generate a

ﬁxed number (a hyperparameter) of options.

Approaches for combining symbolic planning with reinforcement learning (Silver & Ciosek, 2012;

Yang et al., 2018; Jinnai et al., 2019; Lyu et al., 2019; Kokel et al., 2021) use pre-deﬁned abstract

models to combine symbolic planning with reinforcement learning. In contrast, our approach learns

such options (including initiation and termination sets) as well as their policies and uses them to

compute solutions for stochastic path planning problems with continuous state and action spaces.

Now, we discuss some of the existing concepts required to understand our approach.

3 BACKGROUND

Motion planning

Let

X=Xfree ∪ Xobs

be the conﬁguration space of a robot

and let

be the

set of obstacles in a given environment. Given a collision function

f:X → {0,1}

Xfree

represents

the set of conﬁgurations that are not in collision with any obstacle

o∈O

such that

f(x)=0

and

Xobs

represents the set of conﬁgurations in collision such that

f(x)=1

. Let

xi∈ Xfree

be the initial

conﬁguration of the robot and

xg∈ Xfree

be the goal conﬁguration of the robot. The motion planning

problem can be deﬁned as:

Deﬁnition 1.

motion planning problem M

is deﬁned as a

-tuple

hX , f, xi, xgi

, where

Xfree ∪ Xobs is the conﬁguration space, fis the collision function, xiis an initial conﬁguration, and

xgis a goal conﬁguration.

A solution to a motion planning problem is a motion plan

. A motion plan is a sequence of

conﬁgurations

hx0, . . . , xni

such that

x0=xi

xn=xg

, and

∀x∈τ, f(x) = 0

. Robots use

controller that accepts sequenced conﬁgurations from the motion plan and generates controls that take

the robot from one conﬁguration to the next conﬁguration. In practice, these environment dynamics

are noisy, which introduces stochasticity in the overall motion planning problem. This stochasticity

can be handled by computing a motion policy

π:X → X

that takes the current conﬁguration of the

robot and generates the next waypoint for the robot.

Markov decision process

In this work, we deﬁne the stochastic path planning problem as a

continuous stochastic shortest path (SSP) problem (Bertsekas & Tsitsiklis, 1991). A continuous

stochastic shortest path problem is deﬁned as a

-tuple

hX ,A, T, C, s0, Gi

where

is a continuous

state space (conﬁguration space of the robot),

is a set of continuous actions,

T:X ×A×X → [0,1]

is a transition function,

C:X → R

is a cost function,

is an initial state, and

G⊆ S

is a set of

goal states. Discount factor

is set to

for this work. A solution to an SSP is a policy

π:X → A

that maps states to actions that take the robot to the goal states and minimizes the cumulative cost.

Dynamic programming methods such as value iteration (VI) or policy iteration (PI) can be used

to compute such policies when every component of the MDP is known. When one or more SSP

components are unknown various reinforcement learning (RL) approaches (Watkins, 1989; Mnih

et al., 2015; Lillicrap et al., 2016; Haarnoja et al., 2018) can be used to learn policies.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MULTI-TASKOPTIONLEARNINGANDDISCOVERYFORSTOCHASTICPATHPLANNINGNamanShahArizonaStateUniversityTempe,AZ,85281namanshah@asu.eduSiddharthSrivastavaArizonaStateUniversityTempe,AZ,85281siddharths@asu.eduABSTRACTThispaperaddressestheproblemofreliablyandefcientlysolvingbroadclassesoflong-horizonstochasticpa...

展开>> 收起<<

MULTI -TASK OPTION LEARNING AND DISCOVERY FOR STOCHASTIC PATH PLANNING Naman Shah.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MULTI -TASK OPTION LEARNING AND DISCOVERY FOR STOCHASTIC PATH PLANNING Naman Shah

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: