State Advantage Weighting for Offline RL Jiafei Lyu1 Aicheng Gong14 Le Wan3 Zongqing Lu2 Xiu Li1 1Tsinghua Shenzhen International Graduate School Tsinghua University

2025-05-03 0 0 582.84KB 13 页 10玖币
侵权投诉
State Advantage Weighting for Offline RL
Jiafei Lyu1
, Aicheng Gong1,4, Le Wan 3, Zongqing Lu2, Xiu Li1
1Tsinghua Shenzhen International Graduate School, Tsinghua University
2School of Computer Science, Peking University
3IEG, Tencent
4China Nuclear Power Engineering Company Ltd
{lvjf20, gac19}@mails.tsinghua.edu.cn, vinowan@tencent.com,
zongqing.lu@pku.edu.cn, li.xiu@sz.tsinghua.edu.cn
Abstract
We present state advantage weighting for offline reinforcement learning (RL). In
contrast to action advantage
A(s, a)
that we commonly adopt in QSA learning,
we leverage state advantage
A(s, s0)
and QSS learning for offline RL, hence
decoupling the action from values. We expect the agent can get to the high-reward
state and the action is determined by how the agent can get to that corresponding
state. Experiments on D4RL datasets show that our proposed method can achieve
remarkable performance against the common baselines. Furthermore, our method
shows good generalization capability when transferring from offline to online.
1 Introduction
Offline reinforcement learning (offline RL) generally defines the task of learning a policy from a
static dataset, which is typically collected by some unknown process. This setting has aroused wide
attention from the community due to its potential for scaling RL algorithms in real-world problems.
One of the major challenges in offline RL is extrapolation error [
16
,
26
], where the out-of-distribution
(OOD) actions are overestimated. Such error is accumulated through bootstrapping, which in turn
negatively affects policy improvement. Prior methods address this problem via either making the
learned policy stay close to the data-collecting policy (behavior policy) [
15
,
26
,
54
], learning without
querying OOD samples [
24
,
57
], explicitly assigning (low) values to OOD actions [
27
,
32
], leveraging
uncertainty measurement [22,59,55,2], etc.
In this paper, we instead explore a novel QSS-style learning paradigm for offline RL. Specially, we
estimate the state
Q
-function
Q(s, s0)
, which represents the value of transitioning from the state
s
to
the next state
s0
and acting optimally thenceforth:
Q(s, s0) = r(s, s0) + γmaxs00 ∈S Q(s0, s00)
. By
doing so, we decouple actions from the value learning, and the action is determined by how the agent
can reach the next state
s0
. The source of OOD will then turn from next action
a0
into next next
state
s00
. In order to get
s00
, we additionally train a predictive model that predicts the feasible and
high-value state. We deem that this formulation is more close to the decision-making of humans, e.g.,
we predict where we can go and then decide how we can get there when climbing.
Unfortunately, we find that directly applying D3G [
12
], a typical QSS-learning algorithm, is infeasible
in offline settings. We wonder: can QSS learning work for offline RL? Motivated by IQL [
24
], we
propose to learn the value function by expectile regression [
23
] such that both the state
Q
-function
Q(s, s0)
and value function
V(s)
can be well-trained. We train extra dynamics models for predicting
the next next state
s00
. We train an inverse dynamics model
I(s, s0)
to determine the action, i.e., how
to reach
s0
from
s
. We leverage state advantage
A(s, s0) = Q(s, s0)V(s)
, which describes how the
Work done while working as an intern at Tencent IEG.
3rd Offline Reinforcement Learning Workshop at Neural Information Processing Systems, 2022.
arXiv:2210.04251v2 [cs.LG] 8 Nov 2022
next state
s0
is better than the mean value, for weighting the update of the actor and the model. To this
end, we propose
S
tate
A
dvantage
W
eighting (SAW) algorithm. We conduct numerous experiments
on the D4RL benchmarks. The experimental results indicate that our method is competitive or even
better than the prior methods. Furthermore, we demonstrate that our method shows good performance
during online learning, after the policy is initialized offline.
2 Preliminaries
We consider an environment that is formulated by a Markov Decision Process (MDP)
hS,A,R, p, γi
,
where
S
denotes the state space,
A
represents the action space,
R
is the reward function,
p
is the
transition dynamics, and
γ
the discount factor. In QSA learning, the policy
π:S 7→ A
determines the
behavior of the agent. The goal of the reinforcement learning (RL) agent is to maximize the expected
discounted return:
Eπ[P
t=0 γtrt+1]
. The action
Q
-function describes the expected discounted return
by taking action
a
in state
s
:
Qπ(s, a) = Eπ[P
t=0 γtrt+1|s0=s, a0=a]
The action advantage is
defined as: A(s, a) = Q(s, a)V(s), where V(s)is the value function. The Q-learning gives:
Q(s, a)Q(s, a) + α[r+γmax
a0∈A Q(s0, a0)Q(s, a)].(1)
The action is then decided by
arg maxa∈A Q(s, a)
. In QSS learning, we focus on the state
Q
-
function:
Q(s, s0)
. That is, the value in QSS is independent of actions. The action is determined
by an inverse dynamics model
a=I(s, s0)
, i.e., what actions the agent takes such that it can reach
s0
from
s
,
π:S × S 7→ A
. We can similarly define that the optimal value satisfies
Q(s, s0) =
r(s, s0) + γmaxs00 ∈S Q(s0, s00). The Bellman update for QSS gives [12]:
Q(s, s0)Q(s, s0) + α[r+γmax
s00 ∈S Q(s0, s00)Q(s, s0)].(2)
We further define the state advantage
A(s, s0) = Q(s, s0)V(s)
, which measures how good the
next state s0is over the mean value.
3 SAW: State Advantage Weighting for Offline RL
In this section, we first experimentally show that directly applying a typical QSS learning algorithm,
D3G [
12
], results in a failure in offline RL. We then present our novel offline RL method, SAW,
which leverages the state advantage for weighting the update of the actor and the prediction model.
3.1 D3G Fails in Offline RL
As a typical QSS learning algorithm, D3G [
12
] aims at learning a policy with the assumption of
deterministic transition dynamics. In addition to the state
Q
-function, it learns three models, a
prediction model
M(s)
that predicts the next state with the current state as the input; an inverse
dynamics model
I(s, s0)
that decides how to act to reach
s0
starting from
s
; a forward model
F(s, a)
that receives state and action as input and outputs the next state, making sure that the proposed state
by the prediction model can be reached in a single step. The prediction model, inverse dynamics
model (actor), and forward model are all trained in a supervised learning manner. Unfortunately,
D3G exhibits very poor performance on continuous control tasks with its vanilla formulation (e.g.,
Walker2d-v2, Humanoid-v2). We then wonder: will D3G succeed in offline settings?
We examine this by conducting experiments on hopper-medium-v2 from D4RL [
14
] MuJoCo datasets.
We observe in Figure 1(a) that D3G fails to learn a meaningful policy on this dataset. As shown in
Figure 1(b), the
Q
value (i.e.,
Q(s, s0)
) is extremely overestimated (up to the scale of
1012
). We then
wonder, which is the key intuition of this paper, can we make QSS learning work in offline RL? This
is important due to its potential for promoting learning from observation and goal-conditioned RL in
the offline manner.
To this end, we propose our novel QSS learning algorithm,
S
tate
A
dvantage
W
eighting (SAW). We
observe that our method, SAW, exhibits very good performance on hopper-medium-v2, with its value
estimated fairly well, as can be seen in Figure 1(c).
2
(a) Normalized score (b) D3G Qestimate (c) SAW Qestimate
Figure 1: Normalized score comparison of D3G against our method on hopper-medium-v2 from
D4RL (a). The
Q
value estimate of D3G incurs severe overestimation (b) while our SAW does not (c).
The results are obtained over 5 random runs, and the shaded region captures the standard deviation.
3.2 State Advantage Weighting
Under the novel framework of QSS learning, we also aim at learning
Q(s, s0)
. To boost the stability
of the value estimate and avoid overestimation, we leverage the state advantage
A(s, s0)
instead of
the action advantage
A(s, a)
. Our method is motivated by IQL [
24
], which learns entirely within
the support of the dataset. IQL trains the value function
V(s)
using a neural network, and leverages
expectile regression for updating the critic and (action) advantage weighted regression for updating
the actor. Similarly, we adopt expectile regression for the critic and (state) advantage weighted
regression for updating the prediction model and the actor.
To be specific, we need to train four extra parts other than the critic, a value function
V(s)
, a forward
dynamics model
F(s, a)
, a prediction model
M(s)
, and an inverse dynamics model
I(s, s0)
(the
actor). The critic we want to learn is updated via expectile regression, which is closely related to the
quantile regression [34]. The expectile regression gives:
arg min
mτ
ExX[Lτ
2(xmτ)],(3)
where
Lτ
2(u) = |τI(u < 0)|u2
,
I
is the indicator function,
X
is a collection of some random
variable. This loss generally emphasizes the contributions of
x
values larger than
mτ
and downweights
those small ones. To ease the stochasticity from the environment (identical to IQL), we introduce the
value function and approximate the expectile with respect to the distribution of next state, i.e.,
Lψ=Es,s0∼D[Lτ
2(Qθ0(s, s0)Vψ(s))],(4)
where the state
Q
-function is parameterized by
θ
with a target network parameter
θ0
, and the value
function is parameterized by ψ. Then, the state Q-functions are updated with the MSE loss:
Lθ=Es,s0∼D[(r(s, s0) + γVψ(s0)Qθ(s, s0))2].(5)
Note that in Equation (4) and (5), we only use state and next state from the fixed dataset to update the
state Q-function and value function, leaving out any worry of bootstrapping error.
Training the forward model.
The forward model
Fφ(s, a)
parameterized by
φ
receives the state
and action as input and predicts the next state (no reward signal is predicted). A forward model is
required as we want to ensure that the proposed state by our method is reachable in one step. To be
specific, if we merely train one forward model
f(s)
that predicts the next state based on the current
state, there is every possibility that the proposed state is unreachable, inaccurate, or even invalid.
However, if we train a forward model to predict the possible next state and encode that information
in the prediction model, it can enhance the reliability of the predicted state. The forward model is
trained by minimizing:
Lφ=Es,a,s0∼DkFφ(s, a)s0k2
2.(6)
Training the reverse dynamics model.
We also need the reverse dynamics model
Iω(s, s0)
parame-
terized by
ω
to help us identify how the agent can reach the next state
s0
starting from the current
state
s
. The inverse dynamics model is trained by weighted imitation learning, which is similar in
spirit to advantage weighted regression (AWR) [40,24,50,39,37]:
Lω=Es,a,s0∼D exp (βA(s, s0))kIω(s, s0)ak2
2,(7)
3
摘要:

StateAdvantageWeightingforOfineRLJiafeiLyu1,AichengGong1;4,LeWan3,ZongqingLu2,XiuLi11TsinghuaShenzhenInternationalGraduateSchool,TsinghuaUniversity2SchoolofComputerScience,PekingUniversity3IEG,Tencent4ChinaNuclearPowerEngineeringCompanyLtd{lvjf20,gac19}@mails.tsinghua.edu.cn,vinowan@tencent.com,zo...

展开>> 收起<<
State Advantage Weighting for Offline RL Jiafei Lyu1 Aicheng Gong14 Le Wan3 Zongqing Lu2 Xiu Li1 1Tsinghua Shenzhen International Graduate School Tsinghua University.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:582.84KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注