Reachability-Aware Laplacian Representation in Reinforcement Learning Kaixin Wang1Kuangqi Zhou2Jiashi Feng3Bryan Hooi14Xinchao Wang12

2025-04-29 0 0 8.91MB 25 页 10玖币
侵权投诉
Reachability-Aware Laplacian Representation
in Reinforcement Learning
Kaixin Wang1Kuangqi Zhou2Jiashi Feng3Bryan Hooi1,4 Xinchao Wang1,2
{kaixin.wang, kzhou}@u.nus.edu
jshfeng@gmail.com
{dcsbhk, xinchao}@nus.edu.sg
1Institute of Data Science, National University of Singapore
2Electrical and Computer Engineering, National University of Singapore
3ByteDance
4School of Computing, National University of Singapore
Abstract
In Reinforcement Learning (RL), Laplacian Representation (
LapRep
) is a task-
agnostic state representation that encodes the geometry of the environment. A
desirable property of
LapRep
stated in prior works is that the Euclidean distance in
the
LapRep
space roughly reflects the reachability between states, which motivates
the usage of this distance for reward shaping. However, we find that
LapRep
does not necessarily have this property in general: two states having small dis-
tance under
LapRep
can actually be far away in the environment. Such mismatch
would impede the learning process in reward shaping. To fix this issue, we intro-
duce a Reachability-Aware Laplacian Representation (
RA-LapRep
), by properly
scaling each dimension of
LapRep
. Despite the simplicity, we demonstrate that
RA-LapRep
can better capture the inter-state reachability as compared to
LapRep
,
through both theoretical explanations and experimental results. Additionally, we
show that this improvement yields a significant boost in reward shaping perfor-
mance and also benefits bottleneck state discovery.
1 Introduction
Reinforcement Learning (RL) seeks to learn a decision strategy that advises the agent on how to
take actions according to the perceived states [
28
]. The state representation plays an important role
in the agent’s learning process — a proper choice of the state representation can help improve
generalization [
1
,
27
,
33
], encourage exploration [
16
,
18
,
23
] and enhance learning efficiency [
9
,
30
,
31
]. In studying the state representation, one direction of particular interest is to learn task-agnostic
representation that encodes transition dynamics of the environment [19, 21].
Along this line, the Laplacian Representation (
LapRep
) has received increasing attention [
10
,
16
,
20
,
30
,
31
]. Specifically, the
LapRep
is formed by the
d
smallest eigenvectors of the Laplacian matrix of
the graph induced from the transition dynamic (see Section 2.2 for definition). It is assumed in prior
works [
30
,
31
] that
LapRep
has a desirable property: the Euclidean distance in the
LapRep
space
roughly reflects the reachability among states [
30
,
31
], i.e., smaller distance implies that it is easier to
reach one state from another. This motivates the usage of the Euclidean distance under
LapRep
for
reward shaping [30, 31].
Preprint. Under review.
arXiv:2210.13153v1 [cs.LG] 24 Oct 2022
However, there lacks formal justification in previous works [
30
,
31
] for this property. In fact, it turns
out that the Euclidean distance under
LapRep
does not correctly capture the inter-state reachability
in general. Figure 1 (a) shows an example. Under
LapRep
, a state that has larger distance (e.g.,
A
)
might actually be closer to the goal than another state (e.g.,
B
). Consequently, when the agent moves
towards the goal, the pseudo-reward provided by
LapRep
would give a wrong learning signal. Such
mismatch would hinder the learning process with reward shaping and result in inferior performance.
Figure 1: Euclidean distances between each
state and the goal state
G
, under
LapRep (a)
and RA-LapRep (b).
In this work, we introduce a Reachability-Aware
Laplacian Representation (
RA-LapRep
) that reliably
captures the inter-state distances in the environ-
ment geometry (see Figure 1 (b)). Specifically,
RA-
LapRep
is obtained by scaling each dimension of the
LapRep
by the inverse square root of the correspond-
ing eigenvalue. Despite its simplicity,
RA-LapRep
has theoretically justified advantages over
LapRep
in the following two aspects. First, the Euclidean
distance under
RA-LapRep
can be interpreted as the
average commute time, which measures the expected
number of steps required in a random walk to navi-
gate between two states. Thus, such distance provides
a good measure of the reachability. In contrast, to our
best knowledge, there lacks a connection between the
Euclidean distance under
LapRep
and the reachability. Second,
RA-LapRep
is equivalent to the
embedding computed by the classic multidimensional scaling (MDS) [
6
], which preserves pairwise
distances globally [
29
].
LapRep
, on the other hand, preserves only local information (i.e., mapping
neighboring states close), since it is essentially Laplacian eigenmap [
3
]. Thus,
LapRep
is inherently
incompetent for measuring the inter-state reachability.
To further validate the advantages of
RA-LapRep
over
LapRep
, we conduct experiments to compare
them on two discrete gridworld and two continuous control environments. The results show that
RA-
LapRep
indeed performs much better in capturing the inter-state reachability as compared to
LapRep
.
Furthermore, when used for reward shaping in the goal-reaching tasks,
RA-LapRep
significantly
outperforms
LapRep
. In addition, we show that
RA-LapRep
can be used to discover the bottleneck
states based on graph centrality measure, and more accurately find the key states than LapRep.
The rest of the paper is organized as follows. In Section 2, we give some background about RL and
LapRep
in RL. In Section 3, we introduce the new
RA-LapRep
, explain why it is more desirable than
LapRep
, and discuss how to approximate it with neural networks in environments with a large or
continuous state space. Then, we conduct experiments to demonstrate the advantages of
LapRep
over
RA-LapRep in Section 4. In Section 6, we review related works and Section 7 concludes the paper.
2 Background
Notations. We use boldface letters (e.g.,
u
) for vectors, and calligraphic letters (e.g.,
U
) for sets. For
a vector
u
,
kuk
denotes its
L2
norm,
diag(u)
denotes a diagonal matrix whose main diagonal is
u
.
We use 1to denote an all-ones vector, whose dimension can be inferred from the context.
2.1 Reinforcement Learning
In the Reinforcement Learning (RL) framework [
28
], an agent aims to learn a strategy that advise
how to take actions in each state, with the goal of maximizing the expected cumulative reward. We
consider the standard Markov Decision Process (MDP) setting [
25
], and describe an MDP with
a quintuple
(S,A, r, P, γ, µ)
.
S
is the state space and
A
is the action space. The initial state
s0
is generated according to the distribution
µS
, where
S
denotes the space of probability
distributions over
S
. At timestep
t
, the agent observes from the environment a state
st∈ S
and takes
an action
at∈ A
. Then the environment provides the agent with a reward signal
r(st, at)R
. The
state observation in the next timestep
st+1 ∈ S
is sampled from the distribution
P(st, at)S
. We
refer to
P(∆S)S×A
as the transition functions and
r:RS×A
as the reward function. A stationary
stochastic policy
π(∆A)S
specifies a decision making strategy, where
π(s, a)
is the probability
of taking action
a
in state
s
. The agent’s goal is to learn an optimal policy
π
that maximizes the
2
Figure 2: Visualizations of the Euclidean distances under LapRep between all states and the state G.
expected cumulative reward:
π= arg max
πΠ
Eπ,P
X
t=0
γtrt,(1)
where Πdenotes the policy space and γ[0,1) is the discount factor.
2.2 Laplacian representation in RL
The Laplacian Representation (
LapRep
) [
31
] is a task-agnostic state representation in RL, originally
proposed in [
20
] (known as proto-value function). Formally, the Laplacian representations for all
states are the eigenfunctions of Laplace-Beltrami diffusion operator on the state space manifold. For
simplicity, here we restrict the introduction of
LapRep
to the discrete state case and refer readers
to [31] for the formulation in the continuous case.
The states and transitions in an MDP can be viewed as nodes and edges in a graph. The
LapRep
is
formed by
d
smallest eigenvectors of the graph Laplacian (usually
d |S|
). Each eigenvector (of
length
|S|
) corresponds to a dimension of the
LapRep
. Formally, we denote the graph as
G= (S,E)
where
S
is the node set consisting of all states and
E
is the edge set consisting of transitions between
states. The Laplacian matrix of graph
G
is defined as
L:=DA
, where
A
is the adjacency matrix of
G
and
D:= diag(A1)
is the degree matrix [
7
]. We sort the eigenvalues of
L
by their magnitudes and
denote the
i
-th smallest one as
λi
. The unit eigenvector corresponding to
λi
is denoted as
viR|S|
.
Then, the d-dimensional LapRep of a state scan be defined as
ρd(s):= (v1[s],v2[s],··· ,vd[s]),(2)
where
vi[s]
denotes the entry in vector
vi
corresponding to state
s
. In particular,
v1
is a normalized
all-ones vector and hence it provides no information about the environment geometry. Therefore we
omit v1and only consider other dimensions.
For environments with a large or even continuous state space, it is infeasible to obtain the
LapRep
by
directly computing the eigendecomposition. To approximate
LapRep
with neural networks, previous
works [
30
,
31
] propose sample-based methods based on the spectral graph drawing [
15
]. In particular,
Wang et al.
[30]
introduce a generalized graph drawing objective that ensures dimension-wise faithful
approximation to the ground truth ρd(s).
3 Method
3.1 Reachability-aware Laplacian representation
In prior works [
30
,
31
],
LapRep
is believed to have a desirable property that the Euclidean distance
between two states under
LapRep
(i.e.,
distρ(s, s0):=kρd(s)ρd(s0)k
) roughly reflects the
reachability between
s
and
s0
. That is, smaller distance between two states implies that it is easier for
the agent to reach one state from the other. Figure 2 (a) shows an illustrative example similar to the
one in [
31
]. In this example,
distρ(A,G)
is smaller than
distρ(B,G)
, which aligns with the intuition
that moving to
G
from
A
takes fewer steps than from
B
. Motivated by this,
LapRep
is used for reward
shaping in goal-reaching tasks [30, 31].
However, little justification is provided in previous works [
30
,
31
] for this argument (i.e., the Euclidean
distance under
LapRep
captures the inter-state reachability). In fact, we find that it does not hold in
3
general. As shown in Figure 2 (b-e),
distρ(A,G)
is larger than
distρ(B,G)
, but
A
is clearly closer to
G
than
B
. As a result, when the agent moves towards the goal,
distρ
might give a wrong reward signal.
Such mismatch hinders the policy learning process when we use this distance for reward shaping.
In this paper, we introduce the following Reachability-Aware Laplacian Representation (
RA-LapRep
):
φd(s):=v2[s]
λ2
,v3[s]
λ3
,··· ,vd[s]
λd,(3)
which can fix the issue of
LapRep
and better capture the reachability between states. We provide both
theoretical explanation (Section 3.2) and empirical results (Section 4) to demonstrate the advantage
of RA-LapRep over LapRep.
3.2 Why RA-LapRep is more desirable than LapRep?
In this subsection, we provide theoretical groundings for
RA-LapRep
from two aspects, which
explains why it better captures the inter-state reachability than LapRep.
First, we find that the Euclidean distance under
RA-LapRep
is related to a quantity that measures
the expected random walk steps between states. Specifically, let
distφ(s, s0):=kφd(s)φd(s0)k
denote the Euclidean distance between states
s
and
s0
under
RA-LapRep
. When
d=|S|
,
distφ(s, s0)
has a nice interpretation [
12
]: it is proportional to the square root of the average commute time
between states sand s0,i.e.,
distφ(s, s0)pn(s, s0).(4)
Here the average commute time
n(s, s0)
measures the expected number of steps required in a random
walk to navigate from
s
to
s0
and back (see Appendix for the formal definition). Thus,
distφ(s, s0)
provides a good quantification of the concept of reachability. Additionally, with the proportionality
in Eqn.
(4)
,
RA-LapRep
can be used to discover bottleneck states (see Section 4.3 for a detailed
discussion and experiments). In contrast, to the best of our knowledge, the Euclidean distance
under
LapRep
(i.e.,
distρ(s, s0)
) does not have a similar interpretation that matches the concept of
reachability.
Second, we show that
RA-LapRep
preserves global information while
LapRep
only focuses on
preserving local information. Specifically, we note that
RA-LapRep
is equivalent (up to a constant
factor) to the embedding obtained by classic Multi-Dimensional Scaling (MDS) [
6
] with the squared
distance matrix in MDS being
D(2)
ij =n(i, j)
[
12
] (see Appendix for a detailed derivation). Since
classic MDS is known to preserve pairwise distances globally [
29
], the Euclidean distance under
RA-LapRep
is then a good fit for measuring the inter-state reachability. In comparison, the
LapRep
is
only able to preserve local information. This is because, when viewing the MDP transition dynamic
as a graph, the
LapRep
is essentially the Laplacian eigenmap [
3
]. As discussed in [
3
], Laplacian
eigenmap only aims to preserve local graph structure for each single neighborhood in the graph (i.e.,
mapping neighboring states close), while making no attempt in preserving global information about
the whole graph (e.g., pairwise geodesic distances between nodes [
29
]). Therefore, the Euclidean
distance under
LapRep
is inherently not intended for measuring the reachability between states,
particularly for distant states.
3.3 Approximating RA-LapRep
We note that the theoretical reasonings in above subsection are based on
d=|S|
. In practice, however,
for environments with a large or even continuous state space, it is infeasible to have
d=|S|
and
hence we need to take a small
d
. One may argue that, using a small
d
would lead to approximation
error: when
d < |S|
, the distance
distφ(s, s0)
is not exactly proportional to
pn(s, s0)
. Fortunately,
the gap between the approximated
˜n(s, s0)
and the true
n(s, s0)
turns out to be upper bounded by
CP|S|
i=d+1 1
λi
, where
C
is a constant and the summation is over the
|S| − d
largest eigenvalues.
Thus, this bound will not be very large. We will empirically show in Section 4.2 that a small
d
is
sufficient for good reward shaping performance and further increasing
d
does not yield any noticeable
improvement.
Furthermore, even with a small
d
, it is still impractical to obtain
RA-LapRep
via directly computing
eigendecomposition. To tackle this, we follow [
30
] to approximate
RA-LapRep
with neural networks
4
Figure 3: Environments used in our experiments (agents shown in red, and walls in grey).
using sample-based methods. Specifically, we first learn a parameterized approximation fi(·;θ)for
each eigenvector
vi
by optimizing a generalized graph drawing objective [
30
], i.e.,
fi(s;θ)vi[s]
,
where
θ
denotes the learnable parameters of the neural networks. Next, we approximate each
eigenvalue λisimply by
λi=v>
iLviE(s,s0)∈T fi(s;ˆ
θ)fi(s0;ˆ
θ)2
,(5)
where
ˆ
θ
denotes the learned parameters, and
T
is the same transition data used to train
f
. Let
˜
λi
denote the approximated eigenvalue. RA-LapRep can be approximated by
φd(s)˜
φd(s) = f2(s;ˆ
θ)
p˜
λ2
,f3(s;ˆ
θ)
p˜
λ3
,··· ,fd(s;ˆ
θ)
p˜
λd!.(6)
In experiments, we find this approximation works quite well and is on par with using the true
φd(s)
.
4 Experiments
In this section, we conduct experiments to validate the benefits of
RA-LapRep
compared to
LapRep
.
Following [
30
,
31
], we consider both discrete gridworld and continuous control environments in our
experiments. Figure 3 shows the layouts of environments used. We briefly introduce them here and
refer readers to Appendix for more details. In discrete gridworld environments, the agent takes one of
the four actions (up,down,left,right) to move from one cell to another. If hitting the wall, the agent
remains in the current cell. In continuous control environments, the agent picks a continuous action
from
[π, π)
that specifies the direction along which the agent moves a fixed small step forward. For
all environments, the observation is the agent’s (x, y)-position.
4.1 Capturing reachability between states
In this subsection, we evaluate the learned
RA-LapRep
and
LapRep
in capturing the reachability
among states. We also include the ground-truth
RA-LapRep
for comparison. Specifically, for each
state
s
, we compute the Euclidean distance between
s
and an arbitrarily chosen goal state
sgoal
, under
learned
LapRep ˜
ρ
, learned
RA-LapRep ˜
φ
, and ground-truth
RA-LapRep φ
(i.e.,
dist˜
φ(s, sgoal)
,
dist˜
ρ(s, sgoal)
and
distφ(s, sgoal)
). Then, we use heatmaps to visualize the three distances. For
continuous environments, the heatmaps are obtained by first sampling a set of states roughly covering
the state space, and then performing interpolation among sampled states.
We train neural networks to learn
RA-LapRep
and
LapRep
. Specifically, we implement the two-step
approximation procedure introduced in Section 3.3 to learn
RA-LapRep
, and adopt the method in
[
30
] to learn
LapRep
. Details of training and network architectures are in Appendix. The ground
truth
RA-LapRep φ
is calculated using Eqn.
(3)
. For discrete environments, the eigenvectors and
eigenvalues are computed by eigendecomposition; For continuous environments, the eigenfunctions
and eigenvalues are approximated by the finite difference method with 5-point stencil [24].
The visualization results are shown in Figure 4. For clearer comparison, we highlight in each
environment an example trajectory, and plot the distance values along each trajectory in the line charts
on the right. As we can see, for both discrete and continuous environments, as the agent is moving
5
摘要:

Reachability-AwareLaplacianRepresentationinReinforcementLearningKaixinWang1KuangqiZhou2JiashiFeng3BryanHooi1,4XinchaoWang1,2{kaixin.wang,kzhou}@u.nus.edujshfeng@gmail.com{dcsbhk,xinchao}@nus.edu.sg1InstituteofDataScience,NationalUniversityofSingapore2ElectricalandComputerEngineering,NationalUniversi...

展开>> 收起<<
Reachability-Aware Laplacian Representation in Reinforcement Learning Kaixin Wang1Kuangqi Zhou2Jiashi Feng3Bryan Hooi14Xinchao Wang12.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:8.91MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注