State Advantage Weighting for Ofﬂine RL Jiafei Lyu1 Aicheng Gong14 Le Wan3 Zongqing Lu2 Xiu Li1 1Tsinghua Shenzhen International Graduate School Tsinghua University

2025-05-03 0 0 582.84KB 13 页 10玖币

侵权投诉

State Advantage Weighting for Ofﬂine RL

Jiafei Lyu1∗

, Aicheng Gong1,4, Le Wan 3, Zongqing Lu2, Xiu Li1

1Tsinghua Shenzhen International Graduate School, Tsinghua University

2School of Computer Science, Peking University

3IEG, Tencent

4China Nuclear Power Engineering Company Ltd

{lvjf20, gac19}@mails.tsinghua.edu.cn, vinowan@tencent.com,

zongqing.lu@pku.edu.cn, li.xiu@sz.tsinghua.edu.cn

Abstract

We present state advantage weighting for ofﬂine reinforcement learning (RL). In

contrast to action advantage

A(s, a)

that we commonly adopt in QSA learning,

we leverage state advantage

A(s, s0)

and QSS learning for ofﬂine RL, hence

decoupling the action from values. We expect the agent can get to the high-reward

state and the action is determined by how the agent can get to that corresponding

state. Experiments on D4RL datasets show that our proposed method can achieve

remarkable performance against the common baselines. Furthermore, our method

shows good generalization capability when transferring from ofﬂine to online.

1 Introduction

Ofﬂine reinforcement learning (ofﬂine RL) generally deﬁnes the task of learning a policy from a

static dataset, which is typically collected by some unknown process. This setting has aroused wide

attention from the community due to its potential for scaling RL algorithms in real-world problems.

One of the major challenges in ofﬂine RL is extrapolation error [

], where the out-of-distribution

(OOD) actions are overestimated. Such error is accumulated through bootstrapping, which in turn

negatively affects policy improvement. Prior methods address this problem via either making the

learned policy stay close to the data-collecting policy (behavior policy) [

], learning without

querying OOD samples [

], explicitly assigning (low) values to OOD actions [

], leveraging

uncertainty measurement [22,59,55,2], etc.

In this paper, we instead explore a novel QSS-style learning paradigm for ofﬂine RL. Specially, we

estimate the state

-function

Q(s, s0)

, which represents the value of transitioning from the state

the next state

and acting optimally thenceforth:

Q(s, s0) = r(s, s0) + γmaxs00 ∈S Q(s0, s00)

. By

doing so, we decouple actions from the value learning, and the action is determined by how the agent

can reach the next state

. The source of OOD will then turn from next action

into next next

state

s00

. In order to get

s00

, we additionally train a predictive model that predicts the feasible and

high-value state. We deem that this formulation is more close to the decision-making of humans, e.g.,

we predict where we can go and then decide how we can get there when climbing.

Unfortunately, we ﬁnd that directly applying D3G [

], a typical QSS-learning algorithm, is infeasible

in ofﬂine settings. We wonder: can QSS learning work for ofﬂine RL? Motivated by IQL [

], we

propose to learn the value function by expectile regression [

] such that both the state

-function

Q(s, s0)

and value function

V(s)

can be well-trained. We train extra dynamics models for predicting

the next next state

s00

. We train an inverse dynamics model

I(s, s0)

to determine the action, i.e., how

to reach

from

. We leverage state advantage

A(s, s0) = Q(s, s0)−V(s)

, which describes how the

∗Work done while working as an intern at Tencent IEG.

3rd Ofﬂine Reinforcement Learning Workshop at Neural Information Processing Systems, 2022.

arXiv:2210.04251v2 [cs.LG] 8 Nov 2022

next state

is better than the mean value, for weighting the update of the actor and the model. To this

end, we propose

tate

dvantage

eighting (SAW) algorithm. We conduct numerous experiments

on the D4RL benchmarks. The experimental results indicate that our method is competitive or even

better than the prior methods. Furthermore, we demonstrate that our method shows good performance

during online learning, after the policy is initialized ofﬂine.

2 Preliminaries

We consider an environment that is formulated by a Markov Decision Process (MDP)

hS,A,R, p, γi

where

denotes the state space,

represents the action space,

is the reward function,

is the

transition dynamics, and

the discount factor. In QSA learning, the policy

π:S 7→ A

determines the

behavior of the agent. The goal of the reinforcement learning (RL) agent is to maximize the expected

discounted return:

Eπ[P∞

t=0 γtrt+1]

. The action

-function describes the expected discounted return

by taking action

in state

Qπ(s, a) = Eπ[P∞

t=0 γtrt+1|s0=s, a0=a]

The action advantage is

deﬁned as: A(s, a) = Q(s, a)−V(s), where V(s)is the value function. The Q-learning gives:

Q(s, a)←Q(s, a) + α[r+γmax

a0∈A Q(s0, a0)−Q(s, a)].(1)

The action is then decided by

arg maxa∈A Q(s, a)

. In QSS learning, we focus on the state

function:

Q(s, s0)

. That is, the value in QSS is independent of actions. The action is determined

by an inverse dynamics model

a=I(s, s0)

, i.e., what actions the agent takes such that it can reach

from

π:S × S 7→ A

. We can similarly deﬁne that the optimal value satisﬁes

Q∗(s, s0) =

r(s, s0) + γmaxs00 ∈S Q∗(s0, s00). The Bellman update for QSS gives [12]:

Q(s, s0)←Q(s, s0) + α[r+γmax

s00 ∈S Q(s0, s00)−Q(s, s0)].(2)

We further deﬁne the state advantage

A(s, s0) = Q(s, s0)−V(s)

, which measures how good the

next state s0is over the mean value.

3 SAW: State Advantage Weighting for Ofﬂine RL

In this section, we ﬁrst experimentally show that directly applying a typical QSS learning algorithm,

D3G [

], results in a failure in ofﬂine RL. We then present our novel ofﬂine RL method, SAW,

which leverages the state advantage for weighting the update of the actor and the prediction model.

3.1 D3G Fails in Ofﬂine RL

As a typical QSS learning algorithm, D3G [

] aims at learning a policy with the assumption of

deterministic transition dynamics. In addition to the state

-function, it learns three models, a

prediction model

M(s)

that predicts the next state with the current state as the input; an inverse

dynamics model

I(s, s0)

that decides how to act to reach

starting from

; a forward model

F(s, a)

that receives state and action as input and outputs the next state, making sure that the proposed state

by the prediction model can be reached in a single step. The prediction model, inverse dynamics

model (actor), and forward model are all trained in a supervised learning manner. Unfortunately,

D3G exhibits very poor performance on continuous control tasks with its vanilla formulation (e.g.,

Walker2d-v2, Humanoid-v2). We then wonder: will D3G succeed in ofﬂine settings?

We examine this by conducting experiments on hopper-medium-v2 from D4RL [

] MuJoCo datasets.

We observe in Figure 1(a) that D3G fails to learn a meaningful policy on this dataset. As shown in

Figure 1(b), the

value (i.e.,

Q(s, s0)

) is extremely overestimated (up to the scale of

1012

). We then

wonder, which is the key intuition of this paper, can we make QSS learning work in ofﬂine RL? This

is important due to its potential for promoting learning from observation and goal-conditioned RL in

the ofﬂine manner.

To this end, we propose our novel QSS learning algorithm,

tate

dvantage

eighting (SAW). We

observe that our method, SAW, exhibits very good performance on hopper-medium-v2, with its value

estimated fairly well, as can be seen in Figure 1(c).

(a) Normalized score (b) D3G Qestimate (c) SAW Qestimate

Figure 1: Normalized score comparison of D3G against our method on hopper-medium-v2 from

D4RL (a). The

value estimate of D3G incurs severe overestimation (b) while our SAW does not (c).

The results are obtained over 5 random runs, and the shaded region captures the standard deviation.

3.2 State Advantage Weighting

Under the novel framework of QSS learning, we also aim at learning

Q(s, s0)

. To boost the stability

of the value estimate and avoid overestimation, we leverage the state advantage

A(s, s0)

instead of

the action advantage

A(s, a)

. Our method is motivated by IQL [

], which learns entirely within

the support of the dataset. IQL trains the value function

V(s)

using a neural network, and leverages

expectile regression for updating the critic and (action) advantage weighted regression for updating

the actor. Similarly, we adopt expectile regression for the critic and (state) advantage weighted

regression for updating the prediction model and the actor.

To be speciﬁc, we need to train four extra parts other than the critic, a value function

V(s)

, a forward

dynamics model

F(s, a)

, a prediction model

M(s)

, and an inverse dynamics model

I(s, s0)

(the

actor). The critic we want to learn is updated via expectile regression, which is closely related to the

quantile regression [34]. The expectile regression gives:

arg min

mτ

Ex∼X[Lτ

2(x−mτ)],(3)

where

Lτ

2(u) = |τ−I(u < 0)|u2

is the indicator function,

is a collection of some random

variable. This loss generally emphasizes the contributions of

values larger than

mτ

and downweights

those small ones. To ease the stochasticity from the environment (identical to IQL), we introduce the

value function and approximate the expectile with respect to the distribution of next state, i.e.,

Lψ=Es,s0∼D[Lτ

2(Qθ0(s, s0)−Vψ(s))],(4)

where the state

-function is parameterized by

with a target network parameter

θ0

, and the value

function is parameterized by ψ. Then, the state Q-functions are updated with the MSE loss:

Lθ=Es,s0∼D[(r(s, s0) + γVψ(s0)−Qθ(s, s0))2].(5)

Note that in Equation (4) and (5), we only use state and next state from the ﬁxed dataset to update the

state Q-function and value function, leaving out any worry of bootstrapping error.

Training the forward model.

The forward model

Fφ(s, a)

parameterized by

receives the state

and action as input and predicts the next state (no reward signal is predicted). A forward model is

required as we want to ensure that the proposed state by our method is reachable in one step. To be

speciﬁc, if we merely train one forward model

f(s)

that predicts the next state based on the current

state, there is every possibility that the proposed state is unreachable, inaccurate, or even invalid.

However, if we train a forward model to predict the possible next state and encode that information

in the prediction model, it can enhance the reliability of the predicted state. The forward model is

trained by minimizing:

Lφ=Es,a,s0∼DkFφ(s, a)−s0k2

2.(6)

Training the reverse dynamics model.

We also need the reverse dynamics model

Iω(s, s0)

parame-

terized by

to help us identify how the agent can reach the next state

starting from the current

state

. The inverse dynamics model is trained by weighted imitation learning, which is similar in

spirit to advantage weighted regression (AWR) [40,24,50,39,37]:

Lω=Es,a,s0∼D exp (βA(s, s0))kIω(s, s0)−ak2

2,(7)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

StateAdvantageWeightingforOfineRLJiafeiLyu1,AichengGong1;4,LeWan3,ZongqingLu2,XiuLi11TsinghuaShenzhenInternationalGraduateSchool,TsinghuaUniversity2SchoolofComputerScience,PekingUniversity3IEG,Tencent4ChinaNuclearPowerEngineeringCompanyLtd{lvjf20,gac19}@mails.tsinghua.edu.cn,vinowan@tencent.com,zo...

展开>> 收起<<

State Advantage Weighting for Ofﬂine RL Jiafei Lyu1 Aicheng Gong14 Le Wan3 Zongqing Lu2 Xiu Li1 1Tsinghua Shenzhen International Graduate School Tsinghua University.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

State Advantage Weighting for Ofﬂine RL Jiafei Lyu1 Aicheng Gong14 Le Wan3 Zongqing Lu2 Xiu Li1 1Tsinghua Shenzhen International Graduate School Tsinghua University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: