
imitation learning with reinforcement learning for identifying subgoals from expert trajectories and
bootstrap reinforcement learning. Levy et al. (2019) learn a multi-level policy where each level learns
subgoals for the policy at the lower level using Hindsight Experience Replay (HER) (Andrychowicz
et al., 2017) for control problems rather than long-horizon motion planning problems in deterministic
settings. Kim et al. (2021) randomly sample subgoals in the environment and use a path planning
algorithm to select the closest subgoal and learn a policy that achieves this subgoal.
Numerous approaches (Stolle & Precup, 2002; ¸Sim¸sek et al., 2005; Brunskill & Li, 2014; Kurutach
et al., 2018; Eysenbach et al., 2019; Bagaria & Konidaris, 2020; Bagaria et al., 2021) perform
hierarchical learning by identifying task-specific options through experience collected in the test
environment and then use these options (Sutton et al., 1999) along with low-level primitive actions.
Stolle & Precup (2002); ¸Sim¸sek et al. (2005) lay foundation for discovering options in discrete
settings. They collect trajectories in the environment and use them to identify high-frequency states in
the environment. These states are used as termination sets of the options and initiation sets are derived
by selecting states that lead to these high-frequency states. Once options are identified, they use
Q-learning to learn policies for these options independently to formulate Semi-MDPs (Sutton et al.,
1999). Bagaria & Konidaris (2020) learn options in a reverse fashion. They compute trajectories in
the environment that reaches the goal state. In these trajectories, they use the last
K
points to define
an option. They use these points to define the initiation set of the option and the goal state is used as
the termination set. They continue to partition rest of collected trajectories similarly and generate a
fixed number (a hyperparameter) of options.
Approaches for combining symbolic planning with reinforcement learning (Silver & Ciosek, 2012;
Yang et al., 2018; Jinnai et al., 2019; Lyu et al., 2019; Kokel et al., 2021) use pre-defined abstract
models to combine symbolic planning with reinforcement learning. In contrast, our approach learns
such options (including initiation and termination sets) as well as their policies and uses them to
compute solutions for stochastic path planning problems with continuous state and action spaces.
Now, we discuss some of the existing concepts required to understand our approach.
3 BACKGROUND
Motion planning
Let
X=Xfree ∪ Xobs
be the configuration space of a robot
R
and let
O
be the
set of obstacles in a given environment. Given a collision function
f:X → {0,1}
,
Xfree
represents
the set of configurations that are not in collision with any obstacle
o∈O
such that
f(x)=0
and
Xobs
represents the set of configurations in collision such that
f(x)=1
. Let
xi∈ Xfree
be the initial
configuration of the robot and
xg∈ Xfree
be the goal configuration of the robot. The motion planning
problem can be defined as:
Definition 1.
A
motion planning problem M
is defined as a
4
-tuple
hX , f, xi, xgi
, where
X=
Xfree ∪ Xobs is the configuration space, fis the collision function, xiis an initial configuration, and
xgis a goal configuration.
A solution to a motion planning problem is a motion plan
τ
. A motion plan is a sequence of
configurations
hx0, . . . , xni
such that
x0=xi
,
xn=xg
, and
∀x∈τ, f(x) = 0
. Robots use
controller that accepts sequenced configurations from the motion plan and generates controls that take
the robot from one configuration to the next configuration. In practice, these environment dynamics
are noisy, which introduces stochasticity in the overall motion planning problem. This stochasticity
can be handled by computing a motion policy
π:X → X
that takes the current configuration of the
robot and generates the next waypoint for the robot.
Markov decision process
In this work, we define the stochastic path planning problem as a
continuous stochastic shortest path (SSP) problem (Bertsekas & Tsitsiklis, 1991). A continuous
stochastic shortest path problem is defined as a
5
-tuple
hX ,A, T, C, s0, Gi
where
X
is a continuous
state space (configuration space of the robot),
A
is a set of continuous actions,
T:X ×A×X → [0,1]
is a transition function,
C:X → R
is a cost function,
s0
is an initial state, and
G⊆ S
is a set of
goal states. Discount factor
γ
is set to
1
for this work. A solution to an SSP is a policy
π:X → A
that maps states to actions that take the robot to the goal states and minimizes the cumulative cost.
Dynamic programming methods such as value iteration (VI) or policy iteration (PI) can be used
to compute such policies when every component of the MDP is known. When one or more SSP
components are unknown various reinforcement learning (RL) approaches (Watkins, 1989; Mnih
et al., 2015; Lillicrap et al., 2016; Haarnoja et al., 2018) can be used to learn policies.
3