
However, there lacks formal justification in previous works [
30
,
31
] for this property. In fact, it turns
out that the Euclidean distance under
LapRep
does not correctly capture the inter-state reachability
in general. Figure 1 (a) shows an example. Under
LapRep
, a state that has larger distance (e.g.,
A
)
might actually be closer to the goal than another state (e.g.,
B
). Consequently, when the agent moves
towards the goal, the pseudo-reward provided by
LapRep
would give a wrong learning signal. Such
mismatch would hinder the learning process with reward shaping and result in inferior performance.
Figure 1: Euclidean distances between each
state and the goal state
G
, under
LapRep (a)
and RA-LapRep (b).
In this work, we introduce a Reachability-Aware
Laplacian Representation (
RA-LapRep
) that reliably
captures the inter-state distances in the environ-
ment geometry (see Figure 1 (b)). Specifically,
RA-
LapRep
is obtained by scaling each dimension of the
LapRep
by the inverse square root of the correspond-
ing eigenvalue. Despite its simplicity,
RA-LapRep
has theoretically justified advantages over
LapRep
in the following two aspects. First, the Euclidean
distance under
RA-LapRep
can be interpreted as the
average commute time, which measures the expected
number of steps required in a random walk to navi-
gate between two states. Thus, such distance provides
a good measure of the reachability. In contrast, to our
best knowledge, there lacks a connection between the
Euclidean distance under
LapRep
and the reachability. Second,
RA-LapRep
is equivalent to the
embedding computed by the classic multidimensional scaling (MDS) [
6
], which preserves pairwise
distances globally [
29
].
LapRep
, on the other hand, preserves only local information (i.e., mapping
neighboring states close), since it is essentially Laplacian eigenmap [
3
]. Thus,
LapRep
is inherently
incompetent for measuring the inter-state reachability.
To further validate the advantages of
RA-LapRep
over
LapRep
, we conduct experiments to compare
them on two discrete gridworld and two continuous control environments. The results show that
RA-
LapRep
indeed performs much better in capturing the inter-state reachability as compared to
LapRep
.
Furthermore, when used for reward shaping in the goal-reaching tasks,
RA-LapRep
significantly
outperforms
LapRep
. In addition, we show that
RA-LapRep
can be used to discover the bottleneck
states based on graph centrality measure, and more accurately find the key states than LapRep.
The rest of the paper is organized as follows. In Section 2, we give some background about RL and
LapRep
in RL. In Section 3, we introduce the new
RA-LapRep
, explain why it is more desirable than
LapRep
, and discuss how to approximate it with neural networks in environments with a large or
continuous state space. Then, we conduct experiments to demonstrate the advantages of
LapRep
over
RA-LapRep in Section 4. In Section 6, we review related works and Section 7 concludes the paper.
2 Background
Notations. We use boldface letters (e.g.,
u
) for vectors, and calligraphic letters (e.g.,
U
) for sets. For
a vector
u
,
kuk
denotes its
L2
norm,
diag(u)
denotes a diagonal matrix whose main diagonal is
u
.
We use 1to denote an all-ones vector, whose dimension can be inferred from the context.
2.1 Reinforcement Learning
In the Reinforcement Learning (RL) framework [
28
], an agent aims to learn a strategy that advise
how to take actions in each state, with the goal of maximizing the expected cumulative reward. We
consider the standard Markov Decision Process (MDP) setting [
25
], and describe an MDP with
a quintuple
(S,A, r, P, γ, µ)
.
S
is the state space and
A
is the action space. The initial state
s0
is generated according to the distribution
µ∈∆S
, where
∆S
denotes the space of probability
distributions over
S
. At timestep
t
, the agent observes from the environment a state
st∈ S
and takes
an action
at∈ A
. Then the environment provides the agent with a reward signal
r(st, at)∈R
. The
state observation in the next timestep
st+1 ∈ S
is sampled from the distribution
P(st, at)∈∆S
. We
refer to
P∈(∆S)S×A
as the transition functions and
r:RS×A
as the reward function. A stationary
stochastic policy
π∈(∆A)S
specifies a decision making strategy, where
π(s, a)
is the probability
of taking action
a
in state
s
. The agent’s goal is to learn an optimal policy
π∗
that maximizes the
2