MULTI -TASK OPTION LEARNING AND DISCOVERY FOR STOCHASTIC PATH PLANNING Naman Shah

2025-05-02 0 0 2.9MB 15 页 10玖币
侵权投诉
MULTI-TASK OPTION LEARNING AND DISCOVERY
FOR STOCHASTIC PATH PLANNING
Naman Shah
Arizona State University
Tempe, AZ, 85281
namanshah@asu.edu
Siddharth Srivastava
Arizona State University
Tempe, AZ, 85281
siddharths@asu.edu
ABSTRACT
This paper addresses the problem of reliably and efficiently solving broad classes
of long-horizon stochastic path planning problems. Starting with a vanilla RL
formulation with a stochastic dynamics simulator and an occupancy matrix of
the environment, our approach computes useful options with policies as well as
high-level paths that compose the discovered options.
Our main contributions are (
1
) data-driven methods for creating abstract states that
serve as end points for helpful options, (
2
) methods for computing option policies
using auto-generated option guides in the form of dense pseudo-reward functions,
and (
3
) an overarching algorithm for composing the computed options. We show
that this approach yields strong guarantees of executability and solvability: under
fairly general conditions, the computed option guides lead to composable option
policies and consequently ensure downward refinability. Empirical evaluation on
a range of robots, environments, and tasks shows that this approach effectively
transfers knowledge across related tasks and that it outperforms existing approaches
by a significant margin.
1 INTRODUCTION
Autonomous robots must compute long-horizon motion plans (or path plans) to accomplish their
tasks. Robots use controllers to execute these motion plans by reaching each point in the motion
plan. However, the physical dynamics can be noisy and controllers are not always able to achieve
precise trajectory targets. This prevents robots from deterministically reaching a goal while executing
the computed motion plan and increases the complexity of the motion planning problem. Several
approaches (Schaul et al., 2015; Pong et al., 2018) have used reinforcement learning (RL) to solve
multi-goal stochastic path planning problems by learning goal-conditioned reactive policies. However,
these approaches work only for short-horizon problems (Eysenbach et al., 2019). On the other hand,
multiple approaches have been designed for handling stochasticity in motion planning (Alterovitz
et al., 2007; Sun et al., 2016), but they require discrete actions for the robot. This paper addresses
the following question: Can we develop effective approaches that can efficiently compute plans for
long-horizon continuous stochastic path planning problems? In this paper, we show that we can
develop such an approach by learning abstract states and then learning options that serve as actions
between these abstract states.
Abstractions play an important role in long-horizon planning. Temporally abstracted high-level
actions reduce the horizon of the problem in order to reduce the complexity of the overall decision-
making problem. E.g., a task of reaching a location in a building can be solved using abstract
actions such as “go from room A to corridor A”, “reach elevator from corridor A”, etc., if one can
automatically identify these regions of saliency. Each of these actions is a temporally abstracted
action. Not only do these actions reduce the complexity of the problem, but they also allow the
transfer of knowledge across multiple tasks. E.g, if we learn how to reach room B from room A for a
task, we can reuse the same solution when this abstract action is required to solve some other task.
Reinforcement learning allows learning policies that account for the stochasticity of the environment.
Recent work (Lyu et al., 2019; Yang et al., 2018; Kokel et al., 2021) has shown that combining RL
with abstractions and symbolic planning has enabled robots to solve long-horizon problems that
1
arXiv:2210.00068v2 [cs.LG] 8 Dec 2022
require complex reasoning. However, these approaches require hand-coded abstractions. In this paper,
we show that the abstractions automatically learned (Shah & Srivastava, 2022) can be efficiently
combined with deep reinforcement learning approaches.
The main contributions to this paper are: (1) A formal foundation for constructing a library of
two different types of options that are task-independent and transferable, (2) A novel approach
for auto-generating dense pseudo-reward function in the form of option guides that can be used to
learn policies for synthesized options, and (3) An overall hierarchical algorithm approach that uses
combines these automatically synthesized abstract actions with reinforcement learning and uses them
for multi-task long-horizon continuous stochastic path planning problems. We also show that these
options are composable and can be used as abstract actions with high-level search algorithms.
Our formal approach provides theoretical guarantees about the composability of the options and their
executability using an option guide. We present an extensive evaluation of our approach using two
separates sets of automatically synthesized options in a total of
14
settings to answer three critical
questions: (1) Does this approach learn useful high-level planning representations? (2) Do these
learned representations support transferring learning to new tasks?
The rest of the paper is organized as follows: Sec. 2 some of the existing approaches that are closely
related to our approach; Sec. 3 introduces a few existing ideas used by our approach; Sec. 4 presents
our algorithm; Sec. 5 presents an extensive empirical evaluation of our approach.
2 RELATED WORK
To the best of our knowledge, this is the first approach that uses a data-driven approach for synthesizing
transferable and composable options and leverages these options with a hierarchical algorithm to
compute solutions for stochastic path planning problems It builds upon the concepts of abstraction,
stochastic motion planning, option discovery, and hierarchical reinforcement learning and combines
reinforcement learning with planning. Here, we discuss related work from each of these areas.
Motion planning is a well-researched area. Numerous approaches( (Kavraki et al., 1996; LaValle,
1998; Kuffner & LaValle, 2000; Pivtoraiko et al., 2009; Saxena et al., 2022)) have been developed
for motion planning in deterministic environments. Kavraki et al. (1996); LaValle (1998); Kuffner
& LaValle (2000) develop sampling-based techniques that randomly sample configurations in the
environment and connect them for computing a motion plan from the initial and goal configurations.
Holte et al. (1996); Pivtoraiko et al. (2009); Saxena et al. (2022) discretize the configuration space
and use search techniques such as Asearch to compute motion plans in the discrete space.
Multiple approaches (Du et al., 2010; Kurniawati et al., 2012; Vitus et al., 2012; Berg et al., 2017;
Hibbard et al., 2022) have been developed for performing motion planning with stochastic dynamics.
Alterovitz et al. (2007) build a weighted graph called stochastic motion roadmap (SMR) inspired from
the probabilistic roadmaps (PRM) (Kavraki et al., 1996) where the weights capture the probability of
the robot making the corresponding transition. Sun et al. (2016) use linear quadratic regulator -- a
linear controller that does not explicitly avoid collisions -- along with value iteration to compute a
trajectory that maximizes the expected reward. However, these approaches require an analytical model
of the transition probability of the robot’s dynamics. Tamar et al. (2016) develop a fully differentiable
neural module that approximates the value iteration and can be used for computing solutions for
stochastic path planning problems. However, these approaches (Alterovitz et al., 2007; Sun et al.,
2016; Tamar et al., 2016) require discretized actions. Du et al. (2010); Van Den Berg et al. (2012)
formulate a stochastic motion planning problem as a POMDP to capture the uncertainty in robot
sensing and movements. Multiple approaches (Jurgenson & Tamar, 2019; Eysenbach et al., 2019;
Jurgenson et al., 2020) design end-to-end reinforcement learning approaches for solving stochastic
motion planning problems. These approaches only learn policies to solve one path planning problem
at a time and do not transfer knowledge across multiple problems. In contrast, our approach does not
require discrete actions and learn options that are transferrable to different problems.
Several approaches have considered the problem of learning task-specific subgoals. Kulkarni et al.
(2016); Bacon et al. (2017); Nachum et al. (2018; 2019); Czechowski et al. (2021) use intrinsic
reward functions to learn a two-level hierarchical policy. The high-level policy predicts a subgoal
that the low-level goal-conditioned policy should achieve. The high-level and low-level policies
are then trained simultaneously using simulations in the environment. Paul et al. (2019) combine
2
imitation learning with reinforcement learning for identifying subgoals from expert trajectories and
bootstrap reinforcement learning. Levy et al. (2019) learn a multi-level policy where each level learns
subgoals for the policy at the lower level using Hindsight Experience Replay (HER) (Andrychowicz
et al., 2017) for control problems rather than long-horizon motion planning problems in deterministic
settings. Kim et al. (2021) randomly sample subgoals in the environment and use a path planning
algorithm to select the closest subgoal and learn a policy that achieves this subgoal.
Numerous approaches (Stolle & Precup, 2002; ¸Sim¸sek et al., 2005; Brunskill & Li, 2014; Kurutach
et al., 2018; Eysenbach et al., 2019; Bagaria & Konidaris, 2020; Bagaria et al., 2021) perform
hierarchical learning by identifying task-specific options through experience collected in the test
environment and then use these options (Sutton et al., 1999) along with low-level primitive actions.
Stolle & Precup (2002); ¸Sim¸sek et al. (2005) lay foundation for discovering options in discrete
settings. They collect trajectories in the environment and use them to identify high-frequency states in
the environment. These states are used as termination sets of the options and initiation sets are derived
by selecting states that lead to these high-frequency states. Once options are identified, they use
Q-learning to learn policies for these options independently to formulate Semi-MDPs (Sutton et al.,
1999). Bagaria & Konidaris (2020) learn options in a reverse fashion. They compute trajectories in
the environment that reaches the goal state. In these trajectories, they use the last
K
points to define
an option. They use these points to define the initiation set of the option and the goal state is used as
the termination set. They continue to partition rest of collected trajectories similarly and generate a
fixed number (a hyperparameter) of options.
Approaches for combining symbolic planning with reinforcement learning (Silver & Ciosek, 2012;
Yang et al., 2018; Jinnai et al., 2019; Lyu et al., 2019; Kokel et al., 2021) use pre-defined abstract
models to combine symbolic planning with reinforcement learning. In contrast, our approach learns
such options (including initiation and termination sets) as well as their policies and uses them to
compute solutions for stochastic path planning problems with continuous state and action spaces.
Now, we discuss some of the existing concepts required to understand our approach.
3 BACKGROUND
Motion planning
Let
X=Xfree ∪ Xobs
be the configuration space of a robot
R
and let
O
be the
set of obstacles in a given environment. Given a collision function
f:X → {0,1}
,
Xfree
represents
the set of configurations that are not in collision with any obstacle
oO
such that
f(x)=0
and
Xobs
represents the set of configurations in collision such that
f(x)=1
. Let
xi∈ Xfree
be the initial
configuration of the robot and
xg∈ Xfree
be the goal configuration of the robot. The motion planning
problem can be defined as:
Definition 1.
A
motion planning problem M
is defined as a
4
-tuple
hX , f, xi, xgi
, where
X=
Xfree ∪ Xobs is the configuration space, fis the collision function, xiis an initial configuration, and
xgis a goal configuration.
A solution to a motion planning problem is a motion plan
τ
. A motion plan is a sequence of
configurations
hx0, . . . , xni
such that
x0=xi
,
xn=xg
, and
xτ, f(x) = 0
. Robots use
controller that accepts sequenced configurations from the motion plan and generates controls that take
the robot from one configuration to the next configuration. In practice, these environment dynamics
are noisy, which introduces stochasticity in the overall motion planning problem. This stochasticity
can be handled by computing a motion policy
π:X → X
that takes the current configuration of the
robot and generates the next waypoint for the robot.
Markov decision process
In this work, we define the stochastic path planning problem as a
continuous stochastic shortest path (SSP) problem (Bertsekas & Tsitsiklis, 1991). A continuous
stochastic shortest path problem is defined as a
5
-tuple
hX ,A, T, C, s0, Gi
where
X
is a continuous
state space (configuration space of the robot),
A
is a set of continuous actions,
T:X ×A×X [0,1]
is a transition function,
C:X R
is a cost function,
s0
is an initial state, and
G⊆ S
is a set of
goal states. Discount factor
γ
is set to
1
for this work. A solution to an SSP is a policy
π:X → A
that maps states to actions that take the robot to the goal states and minimizes the cumulative cost.
Dynamic programming methods such as value iteration (VI) or policy iteration (PI) can be used
to compute such policies when every component of the MDP is known. When one or more SSP
components are unknown various reinforcement learning (RL) approaches (Watkins, 1989; Mnih
et al., 2015; Lillicrap et al., 2016; Haarnoja et al., 2018) can be used to learn policies.
3
摘要:

MULTI-TASKOPTIONLEARNINGANDDISCOVERYFORSTOCHASTICPATHPLANNINGNamanShahArizonaStateUniversityTempe,AZ,85281namanshah@asu.eduSiddharthSrivastavaArizonaStateUniversityTempe,AZ,85281siddharths@asu.eduABSTRACTThispaperaddressestheproblemofreliablyandefcientlysolvingbroadclassesoflong-horizonstochasticpa...

展开>> 收起<<
MULTI -TASK OPTION LEARNING AND DISCOVERY FOR STOCHASTIC PATH PLANNING Naman Shah.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:2.9MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注