Reachability-Aware Laplacian Representation in Reinforcement Learning Kaixin Wang1Kuangqi Zhou2Jiashi Feng3Bryan Hooi14Xinchao Wang12

2025-04-29 0 0 8.91MB 25 页 10玖币

侵权投诉

Reachability-Aware Laplacian Representation

in Reinforcement Learning

Kaixin Wang1Kuangqi Zhou2Jiashi Feng3Bryan Hooi1,4 Xinchao Wang1,2

{kaixin.wang, kzhou}@u.nus.edu

jshfeng@gmail.com

{dcsbhk, xinchao}@nus.edu.sg

1Institute of Data Science, National University of Singapore

2Electrical and Computer Engineering, National University of Singapore

3ByteDance

4School of Computing, National University of Singapore

Abstract

In Reinforcement Learning (RL), Laplacian Representation (

LapRep

) is a task-

agnostic state representation that encodes the geometry of the environment. A

desirable property of

LapRep

stated in prior works is that the Euclidean distance in

the

LapRep

space roughly reﬂects the reachability between states, which motivates

the usage of this distance for reward shaping. However, we ﬁnd that

LapRep

does not necessarily have this property in general: two states having small dis-

tance under

LapRep

can actually be far away in the environment. Such mismatch

would impede the learning process in reward shaping. To ﬁx this issue, we intro-

duce a Reachability-Aware Laplacian Representation (

RA-LapRep

), by properly

scaling each dimension of

LapRep

. Despite the simplicity, we demonstrate that

RA-LapRep

can better capture the inter-state reachability as compared to

LapRep

through both theoretical explanations and experimental results. Additionally, we

show that this improvement yields a signiﬁcant boost in reward shaping perfor-

mance and also beneﬁts bottleneck state discovery.

1 Introduction

Reinforcement Learning (RL) seeks to learn a decision strategy that advises the agent on how to

take actions according to the perceived states [

]. The state representation plays an important role

in the agent’s learning process — a proper choice of the state representation can help improve

generalization [

], encourage exploration [

] and enhance learning efﬁciency [

]. In studying the state representation, one direction of particular interest is to learn task-agnostic

representation that encodes transition dynamics of the environment [19, 21].

Along this line, the Laplacian Representation (

LapRep

) has received increasing attention [

]. Speciﬁcally, the

LapRep

is formed by the

smallest eigenvectors of the Laplacian matrix of

the graph induced from the transition dynamic (see Section 2.2 for deﬁnition). It is assumed in prior

works [

] that

LapRep

has a desirable property: the Euclidean distance in the

LapRep

space

roughly reﬂects the reachability among states [

], i.e., smaller distance implies that it is easier to

reach one state from another. This motivates the usage of the Euclidean distance under

LapRep

for

reward shaping [30, 31].

Preprint. Under review.

arXiv:2210.13153v1 [cs.LG] 24 Oct 2022

However, there lacks formal justiﬁcation in previous works [

] for this property. In fact, it turns

out that the Euclidean distance under

LapRep

does not correctly capture the inter-state reachability

in general. Figure 1 (a) shows an example. Under

LapRep

, a state that has larger distance (e.g.,

)

might actually be closer to the goal than another state (e.g.,

). Consequently, when the agent moves

towards the goal, the pseudo-reward provided by

LapRep

would give a wrong learning signal. Such

mismatch would hinder the learning process with reward shaping and result in inferior performance.

Figure 1: Euclidean distances between each

state and the goal state

, under

LapRep (a)

and RA-LapRep (b).

In this work, we introduce a Reachability-Aware

Laplacian Representation (

RA-LapRep

) that reliably

captures the inter-state distances in the environ-

ment geometry (see Figure 1 (b)). Speciﬁcally,

RA-

LapRep

is obtained by scaling each dimension of the

LapRep

by the inverse square root of the correspond-

ing eigenvalue. Despite its simplicity,

RA-LapRep

has theoretically justiﬁed advantages over

LapRep

in the following two aspects. First, the Euclidean

distance under

RA-LapRep

can be interpreted as the

average commute time, which measures the expected

number of steps required in a random walk to navi-

gate between two states. Thus, such distance provides

a good measure of the reachability. In contrast, to our

best knowledge, there lacks a connection between the

Euclidean distance under

LapRep

and the reachability. Second,

RA-LapRep

is equivalent to the

embedding computed by the classic multidimensional scaling (MDS) [

], which preserves pairwise

distances globally [

LapRep

, on the other hand, preserves only local information (i.e., mapping

neighboring states close), since it is essentially Laplacian eigenmap [

]. Thus,

LapRep

is inherently

incompetent for measuring the inter-state reachability.

To further validate the advantages of

RA-LapRep

over

LapRep

, we conduct experiments to compare

them on two discrete gridworld and two continuous control environments. The results show that

RA-

LapRep

indeed performs much better in capturing the inter-state reachability as compared to

LapRep

Furthermore, when used for reward shaping in the goal-reaching tasks,

RA-LapRep

signiﬁcantly

outperforms

LapRep

. In addition, we show that

RA-LapRep

can be used to discover the bottleneck

states based on graph centrality measure, and more accurately ﬁnd the key states than LapRep.

The rest of the paper is organized as follows. In Section 2, we give some background about RL and

LapRep

in RL. In Section 3, we introduce the new

RA-LapRep

, explain why it is more desirable than

LapRep

, and discuss how to approximate it with neural networks in environments with a large or

continuous state space. Then, we conduct experiments to demonstrate the advantages of

LapRep

over

RA-LapRep in Section 4. In Section 6, we review related works and Section 7 concludes the paper.

2 Background

Notations. We use boldface letters (e.g.,

) for vectors, and calligraphic letters (e.g.,

) for sets. For

a vector

kuk

denotes its

norm,

diag(u)

denotes a diagonal matrix whose main diagonal is

We use 1to denote an all-ones vector, whose dimension can be inferred from the context.

2.1 Reinforcement Learning

In the Reinforcement Learning (RL) framework [

], an agent aims to learn a strategy that advise

how to take actions in each state, with the goal of maximizing the expected cumulative reward. We

consider the standard Markov Decision Process (MDP) setting [

], and describe an MDP with

a quintuple

(S,A, r, P, γ, µ)

is the state space and

is the action space. The initial state

is generated according to the distribution

µ∈∆S

, where

∆S

denotes the space of probability

distributions over

. At timestep

, the agent observes from the environment a state

st∈ S

and takes

an action

at∈ A

. Then the environment provides the agent with a reward signal

r(st, at)∈R

. The

state observation in the next timestep

st+1 ∈ S

is sampled from the distribution

P(st, at)∈∆S

. We

refer to

P∈(∆S)S×A

as the transition functions and

r:RS×A

as the reward function. A stationary

stochastic policy

π∈(∆A)S

speciﬁes a decision making strategy, where

π(s, a)

is the probability

of taking action

in state

. The agent’s goal is to learn an optimal policy

π∗

that maximizes the

Figure 2: Visualizations of the Euclidean distances under LapRep between all states and the state G.

expected cumulative reward:

π∗= arg max

π∈Π

Eπ,P

∞

t=0

γtrt,(1)

where Πdenotes the policy space and γ∈[0,1) is the discount factor.

2.2 Laplacian representation in RL

The Laplacian Representation (

LapRep

) [

] is a task-agnostic state representation in RL, originally

proposed in [

] (known as proto-value function). Formally, the Laplacian representations for all

states are the eigenfunctions of Laplace-Beltrami diffusion operator on the state space manifold. For

simplicity, here we restrict the introduction of

LapRep

to the discrete state case and refer readers

to [31] for the formulation in the continuous case.

The states and transitions in an MDP can be viewed as nodes and edges in a graph. The

LapRep

formed by

smallest eigenvectors of the graph Laplacian (usually

d |S|

). Each eigenvector (of

length

|S|

) corresponds to a dimension of the

LapRep

. Formally, we denote the graph as

G= (S,E)

where

is the node set consisting of all states and

is the edge set consisting of transitions between

states. The Laplacian matrix of graph

is deﬁned as

L:=D−A

, where

is the adjacency matrix of

and

D:= diag(A1)

is the degree matrix [

]. We sort the eigenvalues of

by their magnitudes and

denote the

-th smallest one as

λi

. The unit eigenvector corresponding to

λi

is denoted as

vi∈R|S|

Then, the d-dimensional LapRep of a state scan be deﬁned as

ρd(s):= (v1[s],v2[s],··· ,vd[s]),(2)

where

vi[s]

denotes the entry in vector

corresponding to state

. In particular,

is a normalized

all-ones vector and hence it provides no information about the environment geometry. Therefore we

omit v1and only consider other dimensions.

For environments with a large or even continuous state space, it is infeasible to obtain the

LapRep

directly computing the eigendecomposition. To approximate

LapRep

with neural networks, previous

works [

] propose sample-based methods based on the spectral graph drawing [

]. In particular,

Wang et al.

[30]

introduce a generalized graph drawing objective that ensures dimension-wise faithful

approximation to the ground truth ρd(s).

3 Method

3.1 Reachability-aware Laplacian representation

In prior works [

LapRep

is believed to have a desirable property that the Euclidean distance

between two states under

LapRep

(i.e.,

distρ(s, s0):=kρd(s)−ρd(s0)k

) roughly reﬂects the

reachability between

and

. That is, smaller distance between two states implies that it is easier for

the agent to reach one state from the other. Figure 2 (a) shows an illustrative example similar to the

one in [

]. In this example,

distρ(A,G)

is smaller than

distρ(B,G)

, which aligns with the intuition

that moving to

from

takes fewer steps than from

. Motivated by this,

LapRep

is used for reward

shaping in goal-reaching tasks [30, 31].

However, little justiﬁcation is provided in previous works [

] for this argument (i.e., the Euclidean

distance under

LapRep

captures the inter-state reachability). In fact, we ﬁnd that it does not hold in

general. As shown in Figure 2 (b-e),

distρ(A,G)

is larger than

distρ(B,G)

, but

is clearly closer to

than

. As a result, when the agent moves towards the goal,

distρ

might give a wrong reward signal.

Such mismatch hinders the policy learning process when we use this distance for reward shaping.

In this paper, we introduce the following Reachability-Aware Laplacian Representation (

RA-LapRep

φd(s):=v2[s]

√λ2

,v3[s]

√λ3

,··· ,vd[s]

√λd,(3)

which can ﬁx the issue of

LapRep

and better capture the reachability between states. We provide both

theoretical explanation (Section 3.2) and empirical results (Section 4) to demonstrate the advantage

of RA-LapRep over LapRep.

3.2 Why RA-LapRep is more desirable than LapRep?

In this subsection, we provide theoretical groundings for

RA-LapRep

from two aspects, which

explains why it better captures the inter-state reachability than LapRep.

First, we ﬁnd that the Euclidean distance under

RA-LapRep

is related to a quantity that measures

the expected random walk steps between states. Speciﬁcally, let

distφ(s, s0):=kφd(s)−φd(s0)k

denote the Euclidean distance between states

and

under

RA-LapRep

. When

d=|S|

distφ(s, s0)

has a nice interpretation [

]: it is proportional to the square root of the average commute time

between states sand s0,i.e.,

distφ(s, s0)∝pn(s, s0).(4)

Here the average commute time

n(s, s0)

measures the expected number of steps required in a random

walk to navigate from

and back (see Appendix for the formal deﬁnition). Thus,

distφ(s, s0)

provides a good quantiﬁcation of the concept of reachability. Additionally, with the proportionality

in Eqn.

(4)

RA-LapRep

can be used to discover bottleneck states (see Section 4.3 for a detailed

discussion and experiments). In contrast, to the best of our knowledge, the Euclidean distance

under

LapRep

(i.e.,

distρ(s, s0)

) does not have a similar interpretation that matches the concept of

reachability.

Second, we show that

RA-LapRep

preserves global information while

LapRep

only focuses on

preserving local information. Speciﬁcally, we note that

RA-LapRep

is equivalent (up to a constant

factor) to the embedding obtained by classic Multi-Dimensional Scaling (MDS) [

] with the squared

distance matrix in MDS being

D(2)

ij =n(i, j)

[

] (see Appendix for a detailed derivation). Since

classic MDS is known to preserve pairwise distances globally [

], the Euclidean distance under

RA-LapRep

is then a good ﬁt for measuring the inter-state reachability. In comparison, the

LapRep

only able to preserve local information. This is because, when viewing the MDP transition dynamic

as a graph, the

LapRep

is essentially the Laplacian eigenmap [

]. As discussed in [

], Laplacian

eigenmap only aims to preserve local graph structure for each single neighborhood in the graph (i.e.,

mapping neighboring states close), while making no attempt in preserving global information about

the whole graph (e.g., pairwise geodesic distances between nodes [

]). Therefore, the Euclidean

distance under

LapRep

is inherently not intended for measuring the reachability between states,

particularly for distant states.

3.3 Approximating RA-LapRep

We note that the theoretical reasonings in above subsection are based on

d=|S|

. In practice, however,

for environments with a large or even continuous state space, it is infeasible to have

d=|S|

and

hence we need to take a small

. One may argue that, using a small

would lead to approximation

error: when

d < |S|

, the distance

distφ(s, s0)

is not exactly proportional to

pn(s, s0)

. Fortunately,

the gap between the approximated

˜n(s, s0)

and the true

n(s, s0)

turns out to be upper bounded by

CP|S|

i=d+1 1

λi

, where

is a constant and the summation is over the

|S| − d

largest eigenvalues.

Thus, this bound will not be very large. We will empirically show in Section 4.2 that a small

sufﬁcient for good reward shaping performance and further increasing

does not yield any noticeable

improvement.

Furthermore, even with a small

, it is still impractical to obtain

RA-LapRep

via directly computing

eigendecomposition. To tackle this, we follow [

] to approximate

RA-LapRep

with neural networks

Figure 3: Environments used in our experiments (agents shown in red, and walls in grey).

using sample-based methods. Speciﬁcally, we ﬁrst learn a parameterized approximation fi(·;θ)for

each eigenvector

by optimizing a generalized graph drawing objective [

], i.e.,

fi(s;θ)≈vi[s]

where

denotes the learnable parameters of the neural networks. Next, we approximate each

eigenvalue λisimply by

λi=v>

iLvi≈E(s,s0)∈T fi(s;ˆ

θ)−fi(s0;ˆ

θ)2

,(5)

where

denotes the learned parameters, and

is the same transition data used to train

. Let

λi

denote the approximated eigenvalue. RA-LapRep can be approximated by

φd(s)≈˜

φd(s) = f2(s;ˆ

θ)

p˜

λ2

,f3(s;ˆ

θ)

p˜

λ3

,··· ,fd(s;ˆ

θ)

p˜

λd!.(6)

In experiments, we ﬁnd this approximation works quite well and is on par with using the true

φd(s)

4 Experiments

In this section, we conduct experiments to validate the beneﬁts of

RA-LapRep

compared to

LapRep

Following [

], we consider both discrete gridworld and continuous control environments in our

experiments. Figure 3 shows the layouts of environments used. We brieﬂy introduce them here and

refer readers to Appendix for more details. In discrete gridworld environments, the agent takes one of

the four actions (up,down,left,right) to move from one cell to another. If hitting the wall, the agent

remains in the current cell. In continuous control environments, the agent picks a continuous action

from

[−π, π)

that speciﬁes the direction along which the agent moves a ﬁxed small step forward. For

all environments, the observation is the agent’s (x, y)-position.

4.1 Capturing reachability between states

In this subsection, we evaluate the learned

RA-LapRep

and

LapRep

in capturing the reachability

among states. We also include the ground-truth

RA-LapRep

for comparison. Speciﬁcally, for each

state

, we compute the Euclidean distance between

and an arbitrarily chosen goal state

sgoal

, under

learned

LapRep ˜

, learned

RA-LapRep ˜

, and ground-truth

RA-LapRep φ

(i.e.,

dist˜

φ(s, sgoal)

dist˜

ρ(s, sgoal)

and

distφ(s, sgoal)

). Then, we use heatmaps to visualize the three distances. For

continuous environments, the heatmaps are obtained by ﬁrst sampling a set of states roughly covering

the state space, and then performing interpolation among sampled states.

We train neural networks to learn

RA-LapRep

and

LapRep

. Speciﬁcally, we implement the two-step

approximation procedure introduced in Section 3.3 to learn

RA-LapRep

, and adopt the method in

[

] to learn

LapRep

. Details of training and network architectures are in Appendix. The ground

truth

RA-LapRep φ

is calculated using Eqn.

(3)

. For discrete environments, the eigenvectors and

eigenvalues are computed by eigendecomposition; For continuous environments, the eigenfunctions

and eigenvalues are approximated by the ﬁnite difference method with 5-point stencil [24].

The visualization results are shown in Figure 4. For clearer comparison, we highlight in each

environment an example trajectory, and plot the distance values along each trajectory in the line charts

on the right. As we can see, for both discrete and continuous environments, as the agent is moving

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Reachability-AwareLaplacianRepresentationinReinforcementLearningKaixinWang1KuangqiZhou2JiashiFeng3BryanHooi1,4XinchaoWang1,2{kaixin.wang,kzhou}@u.nus.edujshfeng@gmail.com{dcsbhk,xinchao}@nus.edu.sg1InstituteofDataScience,NationalUniversityofSingapore2ElectricalandComputerEngineering,NationalUniversi...

展开>> 收起<<

Reachability-Aware Laplacian Representation in Reinforcement Learning Kaixin Wang1Kuangqi Zhou2Jiashi Feng3Bryan Hooi14Xinchao Wang12.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Reachability-Aware Laplacian Representation in Reinforcement Learning Kaixin Wang1Kuangqi Zhou2Jiashi Feng3Bryan Hooi14Xinchao Wang12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: