Model-Based Ofﬂine Reinforcement Learning with Pessimism-Modulated Dynamics Belief Kaiyang GuoYunfeng Shao Yanhui Geng

2025-05-06 0 0 1.58MB 30 页 10玖币

侵权投诉

Model-Based Ofﬂine Reinforcement Learning with

Pessimism-Modulated Dynamics Belief

Kaiyang Guo∗Yunfeng Shao Yanhui Geng

Huawei Noah’s Ark Lab

Abstract

Model-based ofﬂine reinforcement learning (RL) aims to ﬁnd highly rewarding

policy, by leveraging a previously collected static dataset and a dynamics model.

While the dynamics model learned through reuse of the static dataset, its gener-

alization ability hopefully promotes policy learning if properly utilized. To that

end, several works propose to quantify the uncertainty of predicted dynamics, and

explicitly apply it to penalize reward. However, as the dynamics and the reward

are intrinsically different factors in context of MDP, characterizing the impact

of dynamics uncertainty through reward penalty may incur unexpected tradeoff

between model utilization and risk avoidance. In this work, we instead maintain

a belief distribution over dynamics, and evaluate/optimize policy through biased

sampling from the belief. The sampling procedure, biased towards pessimism, is

derived based on an alternating Markov game formulation of ofﬂine RL. We for-

mally show that the biased sampling naturally induces an updated dynamics belief

with policy-dependent reweighting factor, termed Pessimism-Modulated Dynamics

Belief. To improve policy, we devise an iterative regularized policy optimization

algorithm for the game, with guarantee of monotonous improvement under cer-

tain condition. To make practical, we further devise an ofﬂine RL algorithm to

approximately ﬁnd the solution. Empirical results show that the proposed approach

achieves state-of-the-art performance on a wide range of benchmark tasks.

1 Introduction

In typical paradigm of RL, the agent actively interacts with environment and receives feedback to

promote policy improvement. The essential trial-and-error procedure can be costly, unsafe or even

prohibitory in practice (e.g. robotics [

], autonomous driving [

], and healthcare [

]), thus constituting

a major impediment to actual deployment of RL. Meanwhile, for a number of applications, historical

data records are available to reﬂect the system feedback under a predeﬁned policy. This raises the

opportunity to learn policy in purely ofﬂine setting.

In ofﬂine setting, as no further interaction with environment is permitted, the dataset provides a limited

coverage in state-action space. Then, the policy that induces out-of-distribution (OOD) state-action

pairs can not be well evaluated in ofﬂine learning phase, and deploying it online potentially attains

terrible performance. Recent studies have reported that applying vanilla RL algorithms to ofﬂine

dataset exacerbates such a distributional shift [4–6], making them unsuitable for ofﬂine setting.

To tackle the distributional shift issue, a number of ofﬂine RL approaches have been developed.

Speciﬁcally, one category of them propose to directly constrain the policy close to the one collecting

data [

], or penalize Q-value towards conservatism for OOD state-action pairs [

–

]. While

they achieve remarkable performance gains, the policy regularizer and the Q-value penalty tightly

restricts the produced policy within the data manifold. Instead, more recent works consider to

∗Corresponding to: guokaiyang@huawei.com

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06692v2 [cs.LG] 29 Oct 2022

quantify the uncertainty of Q-value with neural network ensembles [

], where the consistent Q-value

estimates indicate high conﬁdence and can be plausibly used during learning process, even for OOD

state-action pairs [

]. However, the uncertainty quantiﬁcation over OOD region highly relies

on how neural network generalizes [

]. As the prior knowledge of Q-function is hard to acquire

and insert into the neural network, the generalization is unlikely reliable to facilitate meaningful

uncertainty quantiﬁcation [16]. Notably, all these works are model-free.

Model-based ofﬂine RL optimizes policy based on a constructed dynamics model. Compared to the

model-free approaches, one prominent advantage is that the prior knowledge of dynamics is easier

to access. First, the generic prior like smoothness widely exists in various domains [

]. Second,

the sufﬁciently learned dynamics models for relevant tasks can act as a data-driven prior for the

concerned task [

–

]. With richer prior knowledge, the uncertainty quantiﬁcation for dynamics is

more trustworthy. Similar to the model-free approach, the dynamics uncertainty can be incorporated

to ﬁnd reliable policy beyond data coverage. However, an additional challenge is how to characterize

the accumulative impact of dynamics uncertainty on the long-term reward, as the system dynamics is

with entirely different meaning compared to the reward or Q-value.

Although existing model-based ofﬂine RL literature theoretically bounds the impact of dynamics

uncertainty on ﬁnal performance, their practical variants characterize the impact through reward

penalty [

]. Concretely, the reward function is penalized by the dynamics uncertainty for each

state-action pair [

], or the agent is enforced to a low-reward absorbing state when the dynamics

uncertainty exceeds a certain level [

]. While optimizing policy in these constructed MDPs stimulates

anti-uncertainty behavior, the ﬁnal policy tends to be over-conservative. For example, even the

transition dynamics for a state-action pair is ambiguous among several possible candidates, these

candidates may generate the states from which the system evolves similarly.

Then, such a state-action

pair should not be treated specially.

Motivated by the above intuition, we propose pessimism-modulated dynamics belief for model-based

ofﬂine RL. In contrast with the previous approaches, the dynamics uncertainty is not explicitly

quantiﬁed. To characterize its impact, we maintain a belief distribution over system dynamics, and

the policy is evaluated/optimized through biased sampling from it. The sampling procedure, biased

towards pessimism, is derived based on an alternating Markov game (AMG) formulation of ofﬂine

RL. We formally show that the biased sampling naturally induces an updated dynamics belief with

policy-dependent reweighting factor, termed Pessimism-Modulated Dynamics Belief. Besides, the

degree of pessimism is monotonously determined by the hyperparameters in sampling procedure.

The considered AMG formulation can be regarded as a generalization of robust MDP, which is

proposed as a surrogate to optimize the percentile performance in face of dynamics uncertainty

[

]. However, robust MDP suffers from two signiﬁcant shortcomings: 1) The percentile criterion

is over-conservative since it ﬁxates on a single pessimistic dynamics instance [

]; 2) Robust

MDP is constructed based on an uncertainty set, and the improper choice of uncertainty set would

further aggravate the degree of conservatism [

]. The AMG formulation is kept from these

shortcomings. To solve the AMG, we devise an iterative regularized policy optimization algorithm,

with guarantee of monotonous improvement under certain condition. To make practical, we further

derive an ofﬂine RL algorithm to approximately ﬁnd the solution, and empirically evaluate it on the

ofﬂine RL benchmark D4RL. The results show that the proposed approach obviously outperforms

previous state-of-the-art (SoTA) in 9 out of 18 environment-dataset conﬁgurations and performs

competitively in the rest, without tuning hyperparameters for each task. The proof of theorems in this

paper are presented in Appendix B.

2 Preliminaries

Markov Decision Process (MDP)

A MDP is depicted by the tuple

(S,A, T, r, ρ0, γ)

, where

S,A

are state and action spaces,

T(s0|s, a)

is the transition probability,

r(s, a)

is the reward function,

ρ0(s)

is the initial state distribution, and

is the discount factor. The goal of RL is to ﬁnd the policy

π:s→∆(A)that maximizes the cumulative discounted reward:

J(π, T ) = Eρ0,T,π "∞

t=0

γtr(st, at)#,(1)

2Or from these states, the system evolves differently but generates similar rewards.

where

∆(·)

denotes the probability simplex. In typical RL paradigm, this is done via actively

interacting with environment.

Ofﬂine RL

In ofﬂine setting, the environment is unaccessible, and only a static dataset

{(s, a, r, s0)}

is provided, containing the previously logged data samples under an unknown behavior

policy. Ofﬂine RL aims to optimize the policy by solely leveraging the ofﬂine dataset.

To simplify the presentation, we assume the reward function

and initial state distribution

ρ0

are

known. Then, the system dynamics is unknown only in terms of the transition probability

. Note

that the considered formulation and the proposed approach can be easily extend to the general case

without additional technical modiﬁcation.

Robust MDP

With the ofﬂine dataset, a straightforward strategy is ﬁrst learning a dynamics model

τ(s0|s, a)

and then optimizing policy via simulation. However, due to the limitedness of available data,

the learned model is inevitably imprecise. Robust MDP [

] is a surrogate to optimize policy with

consideration of the ambiguity of dynamics. Concretely, robust MDP is constructed by introducing an

uncertainty set

T={τ}

to contain plausible transition probabilities. If the uncertainty set includes

the true transition with probability of

(1 −δ)

, the performance of any policy

in true MDP can

be lower bounded by

minτ∈T J(π, τ)

with probability of at least

(1 −δ)

. Thus, the percentile

performance for the true MDP can be optimized by ﬁnding a solution to

max

πmin

τ∈T J(π, τ).(2)

Despite its popularity, Robust MDP suffers from two major shortcomings: First, the percentile

criterion overly ﬁxates on a single pessimistic transition instance, especially when there are multiple

optimal policies for this transition but they lead to dramatically different performance for other

transitions [25, 26]. This behavior results in unnecessarily conservative policy.

Second, the level of conservatism can be further aggravated when the uncertainty set is inappropriately

constructed [

]. For a given policy

, the ideal situation is that

contains the

(1 −δ)

proportion of

transitions with which the policy achieves higher performance than with the other

proportion. Then,

minτ∈T J(π, τ)

is exactly the

-quantile performance. This requires the uncertainty set to be policy-

dependent, and during policy optimization the uncertainty set should change accordingly. Otherwise,

is predetermined and ﬁxed, it is possible to have

τ0/∈ T

with non-zero probability and satisfying

J(π∗, τ0)>minτ∈T J(π∗, τ)

, where

π∗

is the optimal policy for

(2)

. Then, adding

τ0

into

does

not affect the optimal solution of the problem

(2)

. This indicates that we are essentially optimizing a

δ0

-quantile performance, where

δ0

can be much smaller than

. In literature, the uncertainty sets are

mostly predetermined before policy optimization [23, 29–31].

3 Pessimism-Modulated Dynamics Belief

In short, robust MDP is over-conservative due to the ﬁxation on a single pessimistic transition instance

and the predetermination of uncertainty set. In this work, we strive to take the entire spectrum of

plausible transitions into account, and let the algorithm by itself determine which part deserves more

attention. To this end, we consider an alternating Markov game formulation of ofﬂine RL, based on

which the proposed ofﬂine RL approach is derived.

3.1 Formulation

Alternating Markov game (AMG)

The AMG is a specialization of two-player zero-sum game,

depicted by

(S,¯

S,A,¯

A, G, r, ρ0, γ)

. The game starts from a state sampled from

ρ0

, then two players

alternatively choose actions

a∈ A

and

¯a∈¯

under states

s∈ S

and

¯s∈¯

, along with the game

transition deﬁned by

G(¯s|s, a)

and

G(s|¯s, ¯a)

. At each round, the primary player receives reward

r(s, a)and the secondary player receives its negative counterpart −r(s, a).

Ofﬂine RL as AMG

We formulate the ofﬂine RL problem as an AMG, where the primary player

optimizes a reliable policy for our concerned MDP in face of stochastic disturbance from the

secondary player. The AMG is constructed by augmenting the original MDP. As both have the

transition probability, we use game transition and system transition to differentiate them.

For the primary player, its state space

, action space

and reward function

r(s, a)

are same with

those in the original MDP. After the primary player acts, the game emits a

-size set of system

transition candidates

Tsa

, which later acts as the state of secondary player. Formally,

Tsa

is generated

according to

G(¯s=Tsa|s, a) = Y

τsa∈T sa

Psa

T(τsa),(3)

where

τsa(·)

re-denotes the plausible system transition

τ(·|s, a)

for short, and

Psa

is a given belief

distribution over

τsa

. According to

(3)

, the elements in

Tsa

are independent and identically dis-

tributed samples following

Psa

. The major difference to uncertainty set in robust MDP is that the set

introduced here is unﬁxed and stochastic for each step. To distinguish with uncertainty set, we call

it candidate set. The belief distribution

Psa

can be chosen arbitrarily to incorporate knowledge of

system transition. Particularly, when the prior distribution of system transition is accessible,

Psa

can

be obtained as the posterior by integrating the prior and the evidence Dthrough Bayes’ rule.

The secondary player receives the candidate set

Tsa

as state. Thus, its state space can be denoted by

S= ∆N(S)

, i.e., the n-fold Cartesian product of probability simplex over

. Note that the state

Tsa

also takes the role of action space, i.e.,

A=Tsa

, meaning that the action of secondary player is to

choose a system transition from the candidate set. Given the chosen

τsa ∈ T sa

, the game evolves by

sampling τsa, i.e.,

G(s0|¯s=Tsa,¯a=τsa) = τsa(s0),(4)

and the primary player receives

to continue the game. In the following, we use

T(Tsa)

compactly denote the game transition

G(¯s=Tsa|s, a)

(3)

, and omit the superscript

τsa

Tsa and Psa

Twhen it is clear from the context.

For the above AMG, we consider a speciﬁc policy (explained below) for the secondary player, such

that the cumulative discounted reward of the primary player with policy πcan be written as:

J(π) := E

ρ0,π,PN

bminck

τ0∈T0"E

τ0,π,PN

bminck

τ1∈T1· · · "E

τ∞,π "∞

t=0

γtr(st, at)###,(5)

where the subscripts of

and

denote time step, the expectation is over

s0∼ρ0

st>0∼

τt−1(·|st−1, at−1), at∼π(·|st)

and

Tt∼PN

, and the operator

bminck

x∈X f(x)

denotes ﬁnding

th minimum of

f(x)

over

x∈ X

. The policy of secondary player is implicitly deﬁned by the

operator

bminck

x∈X f(x)

. When changing

k∈ {1,2,· · · , N}

, the secondary player exhibits various

degree of adversarial or aggressive disturbance to the future reward. From the view of original MDP,

this behavior raises ﬂexible tendency ranging from pessimism to optimism when evaluating policy

The distinctions between the introduced AMG and the robust MDP are twofold: 1) With a belief

distribution over transitions, robust MDP will select only part of its supports into uncertainty set, and

the set elements are treated indiscriminatingly. It indicates that both the possibility of transitions out

of the uncertainty set and the relative likelihood of transitions within the uncertainty set are discarded.

However, in the AMG, the candidate set simply contains samples drawn from the belief distribution,

implying no information drop in an average sense. Intuitively, by keeping richer knowledge of the

system, the performance evaluation is more exact and away from excessive conservatism; 2) In

robust MDP, the level of conservatism is expected to be controlled by its hyperparameter

. However,

as illustrated in Section 2, a smaller

does not necessarily corresponds to a more conservative

performance evaluation, due to the extra impact from uncertainty set construction. In contrast, for

the AMG, the degree of conservatism is adjusted by the size of candidate size

and the order of

minimum

. When changing values of

, the impact to performance evaluation is ascertained,

as formalized in Theorem 3.

To evaluate J(π), we deﬁne the following Bellman backup operator:

Bπ

N,kQ(s, a) = r(s, a) + γEPN

Thbminck

τ∈T Eτ,π [Q(s0, a0)] i.(6)

As the operator depends on

and we emphasize pessimism in ofﬂine RL, we call it

(N, k)

pessimistic Bellman backup operator. Compared to the standard Bellman backup operator in Q-

learning,

Bπ

N,k

additionally includes the expectation over

T ∼ PN

and the

-minimum operator over

. Despite these differences, we prove that

Bπ

N,k

is still a contraction mapping, based on which

J(π)

can be easily evaluated.

Theorem 1

(Policy Evaluation)

The

(N, k)

-pessimistic Bellman backup operator

Bπ

N,k

is a contrac-

tion mapping. By starting from any function

Q:S × A → R

and repeatedly applying

Bπ

N,k

, the

sequence converges to Qπ

N,k, with which we have J(π) = Eρ0,π Qπ

N,k(s0, a0).

3.2 Pessimism-Modulated Dynamics Belief

With the converged Q-value, we are ready to establish a more direct connection between the AMG

and the original MDP. The connection appears as the answer to a natural question: the calculation of

(6)

encompasses biased samples from the dynamics belief distribution, can we treat these samples

as the unbiased ones sampling from another belief distribution? We give positive answer in the

following theorem.

Theorem 2

(Equivalent MDP with Pessimism-Modulated Dynamics Belief)

The alternating Markov

game in

(5)

is equivalent to the MDP with tuple

(S,A,e

T , r, ρ0, γ)

, where the transition probability

T(s0|s, a) = Ee

Psa

T[τsa(s0)] is deﬁned with the reweighted belief distribution e

Psa

T(τsa)∝wEτsa ,πQπ

N,k(s0, a0);k, NPsa

T(τsa),(7)

w(x;k, N) = F(x)k−11−F(x)N−k,(8)

and

F(·)

is cumulative density function. Furthermore, the value of

w(x;k, N)

ﬁrst increases and

then decreases with x, and its maximum is obtained at the k−1

N−1quantile, i.e., x∗=F−1k−1

N−1.

In right-hand side of

(7)

τsa

itself is random following the belief distribution, thus

Eτsa,π Qπ

N,k(s0, a0)

, as a functional of

τsa

, is also a random variable, whose cumulative density

function is determined by the belief distribution

Psa

. Intuitively, we can treat

Eτsa,π Qπ

N,k(s0, a0)

as a pessimism indicator for transition τsa, with larger value indicating less pessimism.

From Theorem 2, the maximum of

is obtained at

τ∗:FEτ∗,πQπ

N,k(s0, a0)=k−1

N−1

, i.e., the

transition with

k−1

N−1

-quantile pessimism indicator. Besides, when

Eτsa,π Qπ

N,k(s0, a0)

departs the

k−1

N−1

quantile, the reweighting coefﬁcient for its

τsa

decreases. Considering the effect of

Psa

and the equivalence between the AMG and the reﬁned MDP, we can say that

J(π)

is a soft

percentile performance. Compared to the standard percentile criteria,

J(π)

is derived by reshaping

belief distribution towards concentrating around a certain percentile, rather than ﬁxating on a single

percentile point. Due to this feature, we term e

Psa

TPessimism-Modulated Dynamics Belief (PMDB).

Lastly, recall that all the above derivations are with hyperparameters

and

, we present the

monotonicity of

Qπ

N,k

over them in Theorem 3. Furthermore, by combining Theorem 1 with Theorem

3, we conclude that J(π)decreases with Nand increases with k.

Theorem 3 (Monotonicity).The converged Q-function Qπ

N,k are with the following properties:

•Given any k, the Q-function Qπ

N,k element-wisely decreases with N∈ {k, k + 1,· · · }.

•Given any N, the Q-function Qπ

N,k element-wisely increases with k∈ {1,2,· · · , N}.

•The Q-function Qπ

N,N element-wisely increases with N.

1 2 3 4 5 6 7 8

Robust

MDP

MBRL

N=k

Figure 1: Monotonicity of Q-values. The

arrows indicate the directions along which

Q-values increase.

Remark 1

(Special Cases). For

N=k= 1

, we have

Psa

T=Psa

. Then, the performance is evaluated through sam-

pling the initial belief distribution. This resembles the com-

mon methodology in model-based RL (MBRL), with dynam-

ics belief deﬁned by the uniform distribution over dynamics

model ensembles. For

k=δ(N−1) + 1

and

N→ ∞

Psa

asymptotically collapses to be a delta function. Then,

J(π)

degrades to ﬁxate on a single transition instance. It is equiv-

alent to the robust MDP with the uncertainty set constructed

nτsa :Psa

T(τsa)>0,Eτsa ,πQπ

N,k(s0, a0)≥F−1(δ)o

In this sense, the AMG is a successive interpolation between

MBRL and robust MDP.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Model-BasedOfineReinforcementLearningwithPessimism-ModulatedDynamicsBeliefKaiyangGuoYunfengShaoYanhuiGengHuaweiNoah'sArkLabAbstractModel-basedofinereinforcementlearning(RL)aimstondhighlyrewardingpolicy,byleveragingapreviouslycollectedstaticdatasetandadynamicsmodel.Whilethedynamicsmodellearnedthr...

展开>> 收起<<

Model-Based Ofﬂine Reinforcement Learning with Pessimism-Modulated Dynamics Belief Kaiyang GuoYunfeng Shao Yanhui Geng.pdf

共30页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Model-Based Ofﬂine Reinforcement Learning with Pessimism-Modulated Dynamics Belief Kaiyang GuoYunfeng Shao Yanhui Geng

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: