next state
s0
is better than the mean value, for weighting the update of the actor and the model. To this
end, we propose
S
tate
A
dvantage
W
eighting (SAW) algorithm. We conduct numerous experiments
on the D4RL benchmarks. The experimental results indicate that our method is competitive or even
better than the prior methods. Furthermore, we demonstrate that our method shows good performance
during online learning, after the policy is initialized offline.
2 Preliminaries
We consider an environment that is formulated by a Markov Decision Process (MDP)
hS,A,R, p, γi
,
where
S
denotes the state space,
A
represents the action space,
R
is the reward function,
p
is the
transition dynamics, and
γ
the discount factor. In QSA learning, the policy
π:S 7→ A
determines the
behavior of the agent. The goal of the reinforcement learning (RL) agent is to maximize the expected
discounted return:
Eπ[P∞
t=0 γtrt+1]
. The action
Q
-function describes the expected discounted return
by taking action
a
in state
s
:
Qπ(s, a) = Eπ[P∞
t=0 γtrt+1|s0=s, a0=a]
The action advantage is
defined as: A(s, a) = Q(s, a)−V(s), where V(s)is the value function. The Q-learning gives:
Q(s, a)←Q(s, a) + α[r+γmax
a0∈A Q(s0, a0)−Q(s, a)].(1)
The action is then decided by
arg maxa∈A Q(s, a)
. In QSS learning, we focus on the state
Q
-
function:
Q(s, s0)
. That is, the value in QSS is independent of actions. The action is determined
by an inverse dynamics model
a=I(s, s0)
, i.e., what actions the agent takes such that it can reach
s0
from
s
,
π:S × S 7→ A
. We can similarly define that the optimal value satisfies
Q∗(s, s0) =
r(s, s0) + γmaxs00 ∈S Q∗(s0, s00). The Bellman update for QSS gives [12]:
Q(s, s0)←Q(s, s0) + α[r+γmax
s00 ∈S Q(s0, s00)−Q(s, s0)].(2)
We further define the state advantage
A(s, s0) = Q(s, s0)−V(s)
, which measures how good the
next state s0is over the mean value.
3 SAW: State Advantage Weighting for Offline RL
In this section, we first experimentally show that directly applying a typical QSS learning algorithm,
D3G [
12
], results in a failure in offline RL. We then present our novel offline RL method, SAW,
which leverages the state advantage for weighting the update of the actor and the prediction model.
3.1 D3G Fails in Offline RL
As a typical QSS learning algorithm, D3G [
12
] aims at learning a policy with the assumption of
deterministic transition dynamics. In addition to the state
Q
-function, it learns three models, a
prediction model
M(s)
that predicts the next state with the current state as the input; an inverse
dynamics model
I(s, s0)
that decides how to act to reach
s0
starting from
s
; a forward model
F(s, a)
that receives state and action as input and outputs the next state, making sure that the proposed state
by the prediction model can be reached in a single step. The prediction model, inverse dynamics
model (actor), and forward model are all trained in a supervised learning manner. Unfortunately,
D3G exhibits very poor performance on continuous control tasks with its vanilla formulation (e.g.,
Walker2d-v2, Humanoid-v2). We then wonder: will D3G succeed in offline settings?
We examine this by conducting experiments on hopper-medium-v2 from D4RL [
14
] MuJoCo datasets.
We observe in Figure 1(a) that D3G fails to learn a meaningful policy on this dataset. As shown in
Figure 1(b), the
Q
value (i.e.,
Q(s, s0)
) is extremely overestimated (up to the scale of
1012
). We then
wonder, which is the key intuition of this paper, can we make QSS learning work in offline RL? This
is important due to its potential for promoting learning from observation and goal-conditioned RL in
the offline manner.
To this end, we propose our novel QSS learning algorithm,
S
tate
A
dvantage
W
eighting (SAW). We
observe that our method, SAW, exhibits very good performance on hopper-medium-v2, with its value
estimated fairly well, as can be seen in Figure 1(c).
2