DHRL A Graph-Based Approach for Long-Horizon and Sparse Hierarchical Reinforcement Learning Seungjae Lee12 Jigang Kim12 Inkyu Jang13 H. Jin Kim13

2025-04-27 0 0 3.54MB 18 页 10玖币
侵权投诉
DHRL: A Graph-Based Approach for Long-Horizon
and Sparse Hierarchical Reinforcement Learning
Seungjae Lee1,2, Jigang Kim1,2, Inkyu Jang1,3, H. Jin Kim1,3
1Seoul National University
2Artificial Intelligence Institute of Seoul National University (AIIS)
3Automation and Systems Research Institute (ASRI)
{ysz0301, jgkim2020, leplusbon, hjinkim}@snu.ac.kr
Abstract
Hierarchical Reinforcement Learning (HRL) has made notable progress in complex
control tasks by leveraging temporal abstraction. However, previous HRL algo-
rithms often suffer from serious data inefficiency as environments get large. The
extended components,
i.e.
, goal space and length of episodes, impose a burden on
either one or both high-level and low-level policies since both levels share the total
horizon of the episode. In this paper, we present a method of
D
ecoupling
H
orizons
Using a Graph in Hierarchical
R
einforcement
L
earning (DHRL) which can allevi-
ate this problem by decoupling the horizons of high-level and low-level policies
and bridging the gap between the length of both horizons using a graph. DHRL
provides a freely stretchable high-level action interval, which facilitates longer
temporal abstraction and faster training in complex tasks. Our method outperforms
state-of-the-art HRL algorithms in typical HRL environments. Moreover, DHRL
achieves long and complex locomotion and manipulation tasks.
1 Introduction
High-level Horizon
Low-level Horizon
AntMazeComplex
Figure 1:
DHRL:
By decoupling the horizons
of both levels of the hierarchical network,
DHRL not only solves long and sparse tasks
but also significantly outperforms previous
state-of-the-art algorithms.
Reinforcement Learning (RL) has been successfully
applied to a range of robot systems, such as locomo-
tion tasks [
24
,
8
], learning to control aerial robots
[
10
,
13
], and robot manipulation [
14
,
22
]. Goal-
conditioned RL, which augments state with the goal
to train an agent for various goals [
23
,
19
], further
raised the applicability of RL in robot systems allow-
ing the agent to achieve diverse tasks.
Hierarchical Reinforcement Learning (HRL), which
trains multiple levels of goal-conditioned RL, has im-
proved the performance of RL in complex and sparse
tasks with long horizons using temporally extended
policy [
25
,
26
,
16
]. On the back of these strengths,
HRL was adopted to solve various complex robotics
tasks [20, 11, 18].
However, HRL often has difficulty in complex or large environments because of training inefficiency.
Previous studies speculated that the cause of this problem is the large goal space, and restricted
the high-level action space to alleviate this phenomenon [
28
,
12
]. Nevertheless, this approach
performs well only in limited length and complexity and still suffers from the same trouble in larger
environments.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05150v3 [cs.LG] 19 Nov 2022
High-Level Interval
10 30 50
12X12
24X24
56X56
10 30 50
Map Size
[Previous HRL (HRAC)] [Ours (DHRL)]
𝐻𝑖𝑔ℎ 𝐿𝑜𝑤
0.834 0.821 0.028
0.687
0.000
High-Level Interval
1.0 0.8 0.6 0.4 0.2 0.0
Success Rate
0.894 0.951 0.941
12X12 24X24 56X56
Map Size
X
Y
0.951 0.911 0.401
Figure 2: Our method is scalable in large environ-
ments by breaking down the relations between the
two levels and allowing both levels to operate at
their suitable horizons.
We show that this practical limitation of HRL
can be mitigated by breaking down the coupled
horizons of HRL. In previous HRL frameworks,
the horizons of the low level and high level are
related to each other structurally because they
share the total length of the episode. This rela-
tion causes a tradeoff between the training bur-
den of both levels; if the intervals between high-
level actions increase (x-axis in Figure 2), the
low-level policy has to cover a wider range, and
in the opposite case, the high-level policy takes
charge of the extended burden alone in large
environments (y-axis in Figure 2). This is the
reason why the previous HRL algorithms cannot
cope with extended components of large envi-
ronments (see Table 1 for the performance of
the previous HRL method at various intervals).
We break down the coupled horizons of HRL:
To break the relation between the horizons of both
levels, we adopt a graph structure. In our method, the high-level policy can use a longer temporal
abstraction while the lower one only takes charge of smaller coverage by decomposing the subgoal
into several waypoints with a graph. In this way, the HRL algorithm obtains the capability to stretch
the interval of high-level action freely and achieves complex and large tasks, thanks to the enlarged
strength of the HRL.
In summary, our main contributions are:
We show that the previous HRL structures are not scalable in large environments, and that
this limitation can be mitigated by removing coupled traits of high level and low level.
To break down the coupled traits of HRL, we propose DHRL which decouples the horizons
of high-level and low-level policies and bridges the gap using a graph.
Our algorithm outperforms state-of-the-art algorithms in typical HRL environments and
achieves complex and long tasks.
2 Preliminaries
We consider a finite-horizon Universal Markov Decision Process (UMDP) which can be represented as
a tuple
(S,G,A,T,R, γ)
where
S
,
A
and
G
are state space, goal space and action space respectively.
The environment is defined by the transition distribution
T(st+1|st, at)
and reward function
R:
S × A × G R
, where
st∈ S
and
at∈ A
are the state and action at timestep
t
respectively.
Also, total return of a trajectory
τ= (s0, a0, ..., sH, aH)
is
R(τ, g) = PH1
t=0 γtr(st+1, g)
where
r(st+1, g)
(or
r(st, at, g)
) is a goal conditioned reward function and
γ
is a discount factor. Subgoal
sg
, waypoint
wp
and goal
g
are defined in goal space
G
and we consider
G
that is a subspace of
S
with a mapping ψ:S → G.
HRL framework typically has high-level policy
πhi
and low-level policy
πlo
, each maintaining a
separated replay buffer
Bhi
and
Blo
. Our method also follows the general HRL framework and
employs both buffers that store high-level
(st, gt, sgt, rt, st+ch)∈ Bhi
and low-level transition data
(st, wpt, at, rt, st+1)∈ Blo, where chis the interval between high-level action.
One of the key problems in learning HRL is that the low-level policy
πlo
is non-stationary and thus
old data from past policies may contain different next-states
st+ch
even though the high-level policy
provides the same subgoal in the identical state. To bypass this off-policy discrepancy, HRL models
use off-policy correction which relabels the old subgoal of high-level policy to be the most ‘plausible
subgoal’ that will result in a similar transition in old data with the current low-level agent [
17
,
28
,
12
]
(HIRO-style off-policy correction). Other approaches propose to relabel high-level action
sgt
to be
achieved state
st+ch
(HAC-style hindsight action relabelling), reducing the computation cost to find
a ‘plausible subgoal’ [15].
2
Graph-guided RL methods, which combine the strength of RL and planning by decomposing a
long-horizon task into multi-step sub-problems, estimate the temporal distance between states and
goals to construct a graph
G= (V,E)
on goal space
G
without additional prior knowledge about
environments. Previous studies proposed various methods to recover distance from Q-value [
4
,
9
,
27
].
If the agent gets -1 reward at every step except when it is in a goal area where the agent gets 0 reward,
then,
Qlo(s, a|g)
can reveal the temporal distance between
s
to
g
as: (Refer to the Appendix B for
the detailed derivation.)
Dist(sg) = logγ(1 + (1 γ)Qlo(s, π(s, g)|g)) (1)
However, it is known that recovering temporal distance correctly from the vanilla Q-network in
this setting is challenging. For that reason, the previous methods use an additional Value function
approximator [27] or distributional Q-networks [4].
3 Related work
Goal (𝑔𝑡)
Graph Level Policy Level
Subgoal (𝑠𝑔𝑡)
State (𝑠𝑡)
Coverage
Shortest Path
Graph Edge
Ours (DHRL) HIGL Graph-Guided RL
Train
𝜋ℎ𝑖
𝐺𝜋ℎ𝑖
𝜋𝑙𝑜 𝜋𝑙𝑜
𝜋𝑙𝑜
𝐺
𝐺
Figure 3: The differences between DHRL and the
previous graph-based HRL methods. Our algo-
rithm includes the graph structure between both
levels explicitly while the previous methods use the
graph only for training high-level policy or getting
waypoints.
Graph-guided RL.
Graphs have recently
been used as a non-parametric model in rein-
forcement learning (RL) to combine the advan-
tages of RL and planning [
4
,
9
,
27
,
3
,
6
]. By de-
composing a long-horizon task into multi-step
planning problems, these studies have shown
better performance and data efficiency. Search
on the Replay Buffer (SORB) [
4
] constructs
a directed graph based on the states randomly
extracted from a replay buffer and Q-function-
based edge cost estimation. The follow-up stud-
ies further improved the performance of earlier
graph-guided RLs [
4
,
9
] by combining addi-
tional methods such as graph search on latent
space[27] or model predictive control [3].
However, previous papers sidestepped the exploration problems in complex tasks through ‘uniform
initial state distribution’ [
4
,
9
,
27
,
6
], or work only on the dense reward settings [
3
]. We emphasize
that the ‘uniform initial state distribution’ accesses privileged information about the environment
during training by generating the agent uniformly within the feasible area of the map. This greatly
reduces the scope of application of these algorithms. Unlike prior methods, ours can train from sparse
reward settings and a ‘fixed initial state distribution’ without knowledge of the agent’s surroundings,
which makes it practical for physical settings. For detailed examples and comparison of various initial
state distributions, see Table 2 and Figure 11 in Appendix C.
Constrained-subgoal HRL.
To mitigate the training inefficiency issue of HRL, several researchers
proposed methods that restrict the action of high-level agent to be placed in adjacent areas. Hier-
archical Reinforcement Learning with k-step Adjacency Constraint (HRAC) [
28
] limits the sub-
goal to be in the adjacency space of the current state. Hierarchical reinforcement learning Guided
by Landmarks (HIGL) [
12
] improved the data efficiency by adding novelty-based landmarks to
adjacency-constrained HRL. However, these improvements only work on the limited length and
complexity of the environment.
The previous work closest to our method is HIGL. However, there are three key differences between
the previous work and our approach. First, our model explicitly includes the whole graph structure
while HIGL needs to train high-level action to imitate the graph by using an additional loss term
corresponding to the Euclidean distance from the nodes of the graph. Second, we use a graph to
decouple the horizons of high level and low level, unlike the previous method which uses the graph
only for guidance. Most importantly, ours can achieve goals in long and complex environments. To
the best of our knowledge, there is no prior HRL research to train a model that has decoupled the
horizons of the two levels.
3
 




 


   

high low
 
[Planning Over the Graph]

[Graph Construction]


 
Figure 4: An overview of DHRL which includes the mid-level non-parametric policy between the
high and low levels. The high-level (orange box) policy delivers subgoal
sg ∈ G
to the graph level
(green box) and the graph instructs the low-level policy (blue box) to reach the waypoint
wp ∈ G
.
ch
represents how long each high-level action operates for. The low level is given
cl,i
steps to achieve
the goal where ch6=cl,i.
4 Methods
We introduce Decoupling Horizons Using a Graph in Hierarchical Reinforcement Learning (DHRL),
which can separate the time horizons of high-level and low-level policies and bridge the gap between
both horizons using a graph. Our framework consists of high-level policy
πhi(sg|s, g)
, low-level
policy
πlo(a|s, wp)
, and a graph
G
. Given a goal
gt
in the environment, the high-level policy outputs
a subgoal
sgt
(see Figure 4). Then, the shortest path from the current state
st
to
sgt
is found on the
graph. To do so,
st
and
sgt
are added to the existing graph structure, then a sequence of waypoints
(st, wpt,1, wpt,2, ..., sgt)
is returned using a graph search algorithm. Finally, the low-level policy
tries to achieve wpt,i during cl,i steps.
The key point of our method is that the low-level horizon
hlow =cl
is unrelated to the high-level
horizon
hhigh
. Since the high level generates one subgoal every
ch
steps, the H-step task is a
H/ch
-
step task for a high-level agent (
hhigh =H/ch
) where
ch
is the interval between high-level action
(ch> cl). In other words, unlike the previous HRL methods which have the relationship of
hhigh ×hlow =H, (2)
our algorithm does not have such relations, removing an obstacle toward a scalable-RL algorithm.
Since the low-level horizon
cl
is determined by the edge cost between waypoints, we can also express
the
cl
between the
i
-
1th
waypoint and
ith
waypoint as
cl,i
, but we omit the letter
i
in the later
statements that do not need to specify the waypoint.
In section 4.1, we explain how to construct a graph over states and find a path on the graph level. In
section 4.2, we present a low-level policy which can recover temporal distance between states without
overestimation. In section 4.3, We introduce a strategy to train our method through graph-agnostic
off-policy learning. Finally, in section 4.4, we propose additional techniques for better data efficiency
in large environments.
4.1 Graph level: planning over the graph
This section details the graph search part in DHRL planning. We emphasize that every distance in
the DHRL model is based on temporal distance
Dist(·→·)
, obtained through Eq. (1). Thus, our
algorithm requires no further information about the environment (
e.g.
Euclidean distance between
states) than general HRL settings.
To find the shortest path, we adopt Dijkstra’s Algorithm, as in the previous study [
4
]. The differences
from the previous methods [
4
,
27
] are the existence of the high-level policy, and whether the
secondary path is considered. Let
ψ:S → G
be the projection of states on the goal space. In the
graph initialization phase, we samples
n
nodes (also called landmarks) using FPS algorithm [
2
]
(Algorithm 2 in Appendix A) from
ψ(s)
where
s∈ S
is state sampled from
Blo
. Then, we connect
a directed edge
s1s2
if the temporal distance from
s1
to
s2
is less than the cutoff-threshold. In
the planning phase, the graph
G(V,E)
gets the subgoal
sgt
from the high-level policy and adds the
projection of current state
ψ(st)
and
sgt
to
V
, so that the number of nodes in
G
becomes
n+ 2
.
Then, we connect the edges with costs less than the cutoff-threshold between the newly added nodes
4
摘要:

DHRL:AGraph-BasedApproachforLong-HorizonandSparseHierarchicalReinforcementLearningSeungjaeLee1;2,JigangKim1;2,InkyuJang1;3,H.JinKim1;31SeoulNationalUniversity2ArticialIntelligenceInstituteofSeoulNationalUniversity(AIIS)3AutomationandSystemsResearchInstitute(ASRI){ysz0301,jgkim2020,leplusbon,hjinkim...

展开>> 收起<<
DHRL A Graph-Based Approach for Long-Horizon and Sparse Hierarchical Reinforcement Learning Seungjae Lee12 Jigang Kim12 Inkyu Jang13 H. Jin Kim13.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:3.54MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注