DHRL A Graph-Based Approach for Long-Horizon and Sparse Hierarchical Reinforcement Learning Seungjae Lee12 Jigang Kim12 Inkyu Jang13 H. Jin Kim13

2025-04-27 1 0 3.54MB 18 页 10玖币

侵权投诉

DHRL: A Graph-Based Approach for Long-Horizon

and Sparse Hierarchical Reinforcement Learning

Seungjae Lee1,2, Jigang Kim1,2, Inkyu Jang1,3, H. Jin Kim1,3

1Seoul National University

2Artiﬁcial Intelligence Institute of Seoul National University (AIIS)

3Automation and Systems Research Institute (ASRI)

{ysz0301, jgkim2020, leplusbon, hjinkim}@snu.ac.kr

Abstract

Hierarchical Reinforcement Learning (HRL) has made notable progress in complex

control tasks by leveraging temporal abstraction. However, previous HRL algo-

rithms often suffer from serious data inefﬁciency as environments get large. The

extended components,

i.e.

, goal space and length of episodes, impose a burden on

either one or both high-level and low-level policies since both levels share the total

horizon of the episode. In this paper, we present a method of

ecoupling

orizons

Using a Graph in Hierarchical

einforcement

earning (DHRL) which can allevi-

ate this problem by decoupling the horizons of high-level and low-level policies

and bridging the gap between the length of both horizons using a graph. DHRL

provides a freely stretchable high-level action interval, which facilitates longer

temporal abstraction and faster training in complex tasks. Our method outperforms

state-of-the-art HRL algorithms in typical HRL environments. Moreover, DHRL

achieves long and complex locomotion and manipulation tasks.

1 Introduction

High-level Horizon

Low-level Horizon

AntMazeComplex

Figure 1:

DHRL:

By decoupling the horizons

of both levels of the hierarchical network,

DHRL not only solves long and sparse tasks

but also signiﬁcantly outperforms previous

state-of-the-art algorithms.

Reinforcement Learning (RL) has been successfully

applied to a range of robot systems, such as locomo-

tion tasks [

], learning to control aerial robots

[

], and robot manipulation [

]. Goal-

conditioned RL, which augments state with the goal

to train an agent for various goals [

], further

raised the applicability of RL in robot systems allow-

ing the agent to achieve diverse tasks.

Hierarchical Reinforcement Learning (HRL), which

trains multiple levels of goal-conditioned RL, has im-

proved the performance of RL in complex and sparse

tasks with long horizons using temporally extended

policy [

]. On the back of these strengths,

HRL was adopted to solve various complex robotics

tasks [20, 11, 18].

However, HRL often has difﬁculty in complex or large environments because of training inefﬁciency.

Previous studies speculated that the cause of this problem is the large goal space, and restricted

the high-level action space to alleviate this phenomenon [

]. Nevertheless, this approach

performs well only in limited length and complexity and still suffers from the same trouble in larger

environments.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05150v3 [cs.LG] 19 Nov 2022

High-Level Interval

10 30 50

12X12

24X24

56X56

10 30 50

Map Size

[Previous HRL (HRAC)] [Ours (DHRL)]

𝐻𝑖𝑔ℎ 𝐿𝑜𝑤

0.834 0.821 0.028

0.687

0.000

High-Level Interval

1.0 0.8 0.6 0.4 0.2 0.0

Success Rate

0.894 0.951 0.941

12X12 24X24 56X56

Map Size

0.951 0.911 0.401

Figure 2: Our method is scalable in large environ-

ments by breaking down the relations between the

two levels and allowing both levels to operate at

their suitable horizons.

We show that this practical limitation of HRL

can be mitigated by breaking down the coupled

horizons of HRL. In previous HRL frameworks,

the horizons of the low level and high level are

related to each other structurally because they

share the total length of the episode. This rela-

tion causes a tradeoff between the training bur-

den of both levels; if the intervals between high-

level actions increase (x-axis in Figure 2), the

low-level policy has to cover a wider range, and

in the opposite case, the high-level policy takes

charge of the extended burden alone in large

environments (y-axis in Figure 2). This is the

reason why the previous HRL algorithms cannot

cope with extended components of large envi-

ronments (see Table 1 for the performance of

the previous HRL method at various intervals).

We break down the coupled horizons of HRL:

To break the relation between the horizons of both

levels, we adopt a graph structure. In our method, the high-level policy can use a longer temporal

abstraction while the lower one only takes charge of smaller coverage by decomposing the subgoal

into several waypoints with a graph. In this way, the HRL algorithm obtains the capability to stretch

the interval of high-level action freely and achieves complex and large tasks, thanks to the enlarged

strength of the HRL.

In summary, our main contributions are:

•

We show that the previous HRL structures are not scalable in large environments, and that

this limitation can be mitigated by removing coupled traits of high level and low level.

•

To break down the coupled traits of HRL, we propose DHRL which decouples the horizons

of high-level and low-level policies and bridges the gap using a graph.

•

Our algorithm outperforms state-of-the-art algorithms in typical HRL environments and

achieves complex and long tasks.

2 Preliminaries

We consider a ﬁnite-horizon Universal Markov Decision Process (UMDP) which can be represented as

a tuple

(S,G,A,T,R, γ)

where

and

are state space, goal space and action space respectively.

The environment is deﬁned by the transition distribution

T(st+1|st, at)

and reward function

S × A × G → R

, where

st∈ S

and

at∈ A

are the state and action at timestep

respectively.

Also, total return of a trajectory

τ= (s0, a0, ..., sH, aH)

R(τ, g) = PH−1

t=0 γtr(st+1, g)

where

r(st+1, g)

(or

r(st, at, g)

) is a goal conditioned reward function and

is a discount factor. Subgoal

, waypoint

and goal

are deﬁned in goal space

and we consider

that is a subspace of

with a mapping ψ:S → G.

HRL framework typically has high-level policy

πhi

and low-level policy

πlo

, each maintaining a

separated replay buffer

Bhi

and

Blo

. Our method also follows the general HRL framework and

employs both buffers that store high-level

(st, gt, sgt, rt, st+ch)∈ Bhi

and low-level transition data

(st, wpt, at, rt, st+1)∈ Blo, where chis the interval between high-level action.

One of the key problems in learning HRL is that the low-level policy

πlo

is non-stationary and thus

old data from past policies may contain different next-states

st+ch

even though the high-level policy

provides the same subgoal in the identical state. To bypass this off-policy discrepancy, HRL models

use off-policy correction which relabels the old subgoal of high-level policy to be the most ‘plausible

subgoal’ that will result in a similar transition in old data with the current low-level agent [

]

(HIRO-style off-policy correction). Other approaches propose to relabel high-level action

sgt

to be

achieved state

st+ch

(HAC-style hindsight action relabelling), reducing the computation cost to ﬁnd

a ‘plausible subgoal’ [15].

Graph-guided RL methods, which combine the strength of RL and planning by decomposing a

long-horizon task into multi-step sub-problems, estimate the temporal distance between states and

goals to construct a graph

G= (V,E)

on goal space

without additional prior knowledge about

environments. Previous studies proposed various methods to recover distance from Q-value [

If the agent gets -1 reward at every step except when it is in a goal area where the agent gets 0 reward,

then,

Qlo(s, a|g)

can reveal the temporal distance between

as: (Refer to the Appendix B for

the detailed derivation.)

Dist(s→g) = logγ(1 + (1 −γ)Qlo(s, π(s, g)|g)) (1)

However, it is known that recovering temporal distance correctly from the vanilla Q-network in

this setting is challenging. For that reason, the previous methods use an additional Value function

approximator [27] or distributional Q-networks [4].

3 Related work

Goal (𝑔𝑡)

Graph Level Policy Level

Subgoal (𝑠𝑔𝑡)

State (𝑠𝑡)

Coverage

Shortest Path

Graph Edge

Ours (DHRL) HIGL Graph-Guided RL

Train

𝜋ℎ𝑖

𝐺𝜋ℎ𝑖

𝜋𝑙𝑜 𝜋𝑙𝑜

𝜋𝑙𝑜

𝐺

Figure 3: The differences between DHRL and the

previous graph-based HRL methods. Our algo-

rithm includes the graph structure between both

levels explicitly while the previous methods use the

graph only for training high-level policy or getting

waypoints.

Graph-guided RL.

Graphs have recently

been used as a non-parametric model in rein-

forcement learning (RL) to combine the advan-

tages of RL and planning [

]. By de-

composing a long-horizon task into multi-step

planning problems, these studies have shown

better performance and data efﬁciency. Search

on the Replay Buffer (SORB) [

] constructs

a directed graph based on the states randomly

extracted from a replay buffer and Q-function-

based edge cost estimation. The follow-up stud-

ies further improved the performance of earlier

graph-guided RLs [

] by combining addi-

tional methods such as graph search on latent

space[27] or model predictive control [3].

However, previous papers sidestepped the exploration problems in complex tasks through ‘uniform

initial state distribution’ [

], or work only on the dense reward settings [

]. We emphasize

that the ‘uniform initial state distribution’ accesses privileged information about the environment

during training by generating the agent uniformly within the feasible area of the map. This greatly

reduces the scope of application of these algorithms. Unlike prior methods, ours can train from sparse

reward settings and a ‘ﬁxed initial state distribution’ without knowledge of the agent’s surroundings,

which makes it practical for physical settings. For detailed examples and comparison of various initial

state distributions, see Table 2 and Figure 11 in Appendix C.

Constrained-subgoal HRL.

To mitigate the training inefﬁciency issue of HRL, several researchers

proposed methods that restrict the action of high-level agent to be placed in adjacent areas. Hier-

archical Reinforcement Learning with k-step Adjacency Constraint (HRAC) [

] limits the sub-

goal to be in the adjacency space of the current state. Hierarchical reinforcement learning Guided

by Landmarks (HIGL) [

] improved the data efﬁciency by adding novelty-based landmarks to

adjacency-constrained HRL. However, these improvements only work on the limited length and

complexity of the environment.

The previous work closest to our method is HIGL. However, there are three key differences between

the previous work and our approach. First, our model explicitly includes the whole graph structure

while HIGL needs to train high-level action to imitate the graph by using an additional loss term

corresponding to the Euclidean distance from the nodes of the graph. Second, we use a graph to

decouple the horizons of high level and low level, unlike the previous method which uses the graph

only for guidance. Most importantly, ours can achieve goals in long and complex environments. To

the best of our knowledge, there is no prior HRL research to train a model that has decoupled the

horizons of the two levels.

 











 





   





high low

 

[Planning Over the Graph]

 

[Graph Construction]



 

 

  



Figure 4: An overview of DHRL which includes the mid-level non-parametric policy between the

high and low levels. The high-level (orange box) policy delivers subgoal

sg ∈ G

to the graph level

(green box) and the graph instructs the low-level policy (blue box) to reach the waypoint

wp ∈ G

represents how long each high-level action operates for. The low level is given

cl,i

steps to achieve

the goal where ch6=cl,i.

4 Methods

We introduce Decoupling Horizons Using a Graph in Hierarchical Reinforcement Learning (DHRL),

which can separate the time horizons of high-level and low-level policies and bridge the gap between

both horizons using a graph. Our framework consists of high-level policy

πhi(sg|s, g)

, low-level

policy

πlo(a|s, wp)

, and a graph

. Given a goal

in the environment, the high-level policy outputs

a subgoal

sgt

(see Figure 4). Then, the shortest path from the current state

sgt

is found on the

graph. To do so,

and

sgt

are added to the existing graph structure, then a sequence of waypoints

(st, wpt,1, wpt,2, ..., sgt)

is returned using a graph search algorithm. Finally, the low-level policy

tries to achieve wpt,i during cl,i steps.

The key point of our method is that the low-level horizon

hlow =cl

is unrelated to the high-level

horizon

hhigh

. Since the high level generates one subgoal every

steps, the H-step task is a

H/ch

step task for a high-level agent (

hhigh =H/ch

) where

is the interval between high-level action

(ch> cl). In other words, unlike the previous HRL methods which have the relationship of

hhigh ×hlow =H, (2)

our algorithm does not have such relations, removing an obstacle toward a scalable-RL algorithm.

Since the low-level horizon

is determined by the edge cost between waypoints, we can also express

the

between the

1th

waypoint and

ith

waypoint as

cl,i

, but we omit the letter

in the later

statements that do not need to specify the waypoint.

In section 4.1, we explain how to construct a graph over states and ﬁnd a path on the graph level. In

section 4.2, we present a low-level policy which can recover temporal distance between states without

overestimation. In section 4.3, We introduce a strategy to train our method through graph-agnostic

off-policy learning. Finally, in section 4.4, we propose additional techniques for better data efﬁciency

in large environments.

4.1 Graph level: planning over the graph

This section details the graph search part in DHRL planning. We emphasize that every distance in

the DHRL model is based on temporal distance

Dist(·→·)

, obtained through Eq. (1). Thus, our

algorithm requires no further information about the environment (

e.g.

Euclidean distance between

states) than general HRL settings.

To ﬁnd the shortest path, we adopt Dijkstra’s Algorithm, as in the previous study [

]. The differences

from the previous methods [

] are the existence of the high-level policy, and whether the

secondary path is considered. Let

ψ:S → G

be the projection of states on the goal space. In the

graph initialization phase, we samples

nodes (also called landmarks) using FPS algorithm [

]

(Algorithm 2 in Appendix A) from

ψ(s)

where

s∈ S

is state sampled from

Blo

. Then, we connect

a directed edge

s1→s2

if the temporal distance from

is less than the cutoff-threshold. In

the planning phase, the graph

G(V,E)

gets the subgoal

sgt

from the high-level policy and adds the

projection of current state

ψ(st)

and

sgt

, so that the number of nodes in

becomes

n+ 2

Then, we connect the edges with costs less than the cutoff-threshold between the newly added nodes

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DHRL:AGraph-BasedApproachforLong-HorizonandSparseHierarchicalReinforcementLearningSeungjaeLee1;2,JigangKim1;2,InkyuJang1;3,H.JinKim1;31SeoulNationalUniversity2ArticialIntelligenceInstituteofSeoulNationalUniversity(AIIS)3AutomationandSystemsResearchInstitute(ASRI){ysz0301,jgkim2020,leplusbon,hjinkim...

展开>> 收起<<

DHRL A Graph-Based Approach for Long-Horizon and Sparse Hierarchical Reinforcement Learning Seungjae Lee12 Jigang Kim12 Inkyu Jang13 H. Jin Kim13.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DHRL A Graph-Based Approach for Long-Horizon and Sparse Hierarchical Reinforcement Learning Seungjae Lee12 Jigang Kim12 Inkyu Jang13 H. Jin Kim13

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: