
High-Level Interval
10 30 50
12X12
24X24
56X56
10 30 50
Map Size
[Previous HRL (HRAC)] [Ours (DHRL)]
𝐻𝑖𝑔ℎ 𝐿𝑜𝑤
0.834 0.821 0.028
0.687
0.000
High-Level Interval
1.0 0.8 0.6 0.4 0.2 0.0
Success Rate
0.894 0.951 0.941
12X12 24X24 56X56
Map Size
X
Y
0.951 0.911 0.401
Figure 2: Our method is scalable in large environ-
ments by breaking down the relations between the
two levels and allowing both levels to operate at
their suitable horizons.
We show that this practical limitation of HRL
can be mitigated by breaking down the coupled
horizons of HRL. In previous HRL frameworks,
the horizons of the low level and high level are
related to each other structurally because they
share the total length of the episode. This rela-
tion causes a tradeoff between the training bur-
den of both levels; if the intervals between high-
level actions increase (x-axis in Figure 2), the
low-level policy has to cover a wider range, and
in the opposite case, the high-level policy takes
charge of the extended burden alone in large
environments (y-axis in Figure 2). This is the
reason why the previous HRL algorithms cannot
cope with extended components of large envi-
ronments (see Table 1 for the performance of
the previous HRL method at various intervals).
We break down the coupled horizons of HRL:
To break the relation between the horizons of both
levels, we adopt a graph structure. In our method, the high-level policy can use a longer temporal
abstraction while the lower one only takes charge of smaller coverage by decomposing the subgoal
into several waypoints with a graph. In this way, the HRL algorithm obtains the capability to stretch
the interval of high-level action freely and achieves complex and large tasks, thanks to the enlarged
strength of the HRL.
In summary, our main contributions are:
•
We show that the previous HRL structures are not scalable in large environments, and that
this limitation can be mitigated by removing coupled traits of high level and low level.
•
To break down the coupled traits of HRL, we propose DHRL which decouples the horizons
of high-level and low-level policies and bridges the gap using a graph.
•
Our algorithm outperforms state-of-the-art algorithms in typical HRL environments and
achieves complex and long tasks.
2 Preliminaries
We consider a finite-horizon Universal Markov Decision Process (UMDP) which can be represented as
a tuple
(S,G,A,T,R, γ)
where
S
,
A
and
G
are state space, goal space and action space respectively.
The environment is defined by the transition distribution
T(st+1|st, at)
and reward function
R:
S × A × G → R
, where
st∈ S
and
at∈ A
are the state and action at timestep
t
respectively.
Also, total return of a trajectory
τ= (s0, a0, ..., sH, aH)
is
R(τ, g) = PH−1
t=0 γtr(st+1, g)
where
r(st+1, g)
(or
r(st, at, g)
) is a goal conditioned reward function and
γ
is a discount factor. Subgoal
sg
, waypoint
wp
and goal
g
are defined in goal space
G
and we consider
G
that is a subspace of
S
with a mapping ψ:S → G.
HRL framework typically has high-level policy
πhi
and low-level policy
πlo
, each maintaining a
separated replay buffer
Bhi
and
Blo
. Our method also follows the general HRL framework and
employs both buffers that store high-level
(st, gt, sgt, rt, st+ch)∈ Bhi
and low-level transition data
(st, wpt, at, rt, st+1)∈ Blo, where chis the interval between high-level action.
One of the key problems in learning HRL is that the low-level policy
πlo
is non-stationary and thus
old data from past policies may contain different next-states
st+ch
even though the high-level policy
provides the same subgoal in the identical state. To bypass this off-policy discrepancy, HRL models
use off-policy correction which relabels the old subgoal of high-level policy to be the most ‘plausible
subgoal’ that will result in a similar transition in old data with the current low-level agent [
17
,
28
,
12
]
(HIRO-style off-policy correction). Other approaches propose to relabel high-level action
sgt
to be
achieved state
st+ch
(HAC-style hindsight action relabelling), reducing the computation cost to find
a ‘plausible subgoal’ [15].
2