Model-Based Offline Reinforcement Learning with Pessimism-Modulated Dynamics Belief Kaiyang GuoYunfeng Shao Yanhui Geng

2025-05-06 0 0 1.58MB 30 页 10玖币
侵权投诉
Model-Based Offline Reinforcement Learning with
Pessimism-Modulated Dynamics Belief
Kaiyang GuoYunfeng Shao Yanhui Geng
Huawei Noah’s Ark Lab
Abstract
Model-based offline reinforcement learning (RL) aims to find highly rewarding
policy, by leveraging a previously collected static dataset and a dynamics model.
While the dynamics model learned through reuse of the static dataset, its gener-
alization ability hopefully promotes policy learning if properly utilized. To that
end, several works propose to quantify the uncertainty of predicted dynamics, and
explicitly apply it to penalize reward. However, as the dynamics and the reward
are intrinsically different factors in context of MDP, characterizing the impact
of dynamics uncertainty through reward penalty may incur unexpected tradeoff
between model utilization and risk avoidance. In this work, we instead maintain
a belief distribution over dynamics, and evaluate/optimize policy through biased
sampling from the belief. The sampling procedure, biased towards pessimism, is
derived based on an alternating Markov game formulation of offline RL. We for-
mally show that the biased sampling naturally induces an updated dynamics belief
with policy-dependent reweighting factor, termed Pessimism-Modulated Dynamics
Belief. To improve policy, we devise an iterative regularized policy optimization
algorithm for the game, with guarantee of monotonous improvement under cer-
tain condition. To make practical, we further devise an offline RL algorithm to
approximately find the solution. Empirical results show that the proposed approach
achieves state-of-the-art performance on a wide range of benchmark tasks.
1 Introduction
In typical paradigm of RL, the agent actively interacts with environment and receives feedback to
promote policy improvement. The essential trial-and-error procedure can be costly, unsafe or even
prohibitory in practice (e.g. robotics [
1
], autonomous driving [
2
], and healthcare [
3
]), thus constituting
a major impediment to actual deployment of RL. Meanwhile, for a number of applications, historical
data records are available to reflect the system feedback under a predefined policy. This raises the
opportunity to learn policy in purely offline setting.
In offline setting, as no further interaction with environment is permitted, the dataset provides a limited
coverage in state-action space. Then, the policy that induces out-of-distribution (OOD) state-action
pairs can not be well evaluated in offline learning phase, and deploying it online potentially attains
terrible performance. Recent studies have reported that applying vanilla RL algorithms to offline
dataset exacerbates such a distributional shift [4–6], making them unsuitable for offline setting.
To tackle the distributional shift issue, a number of offline RL approaches have been developed.
Specifically, one category of them propose to directly constrain the policy close to the one collecting
data [
4
,
5
,
7
,
8
], or penalize Q-value towards conservatism for OOD state-action pairs [
9
11
]. While
they achieve remarkable performance gains, the policy regularizer and the Q-value penalty tightly
restricts the produced policy within the data manifold. Instead, more recent works consider to
Corresponding to: guokaiyang@huawei.com
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.06692v2 [cs.LG] 29 Oct 2022
quantify the uncertainty of Q-value with neural network ensembles [
12
], where the consistent Q-value
estimates indicate high confidence and can be plausibly used during learning process, even for OOD
state-action pairs [
13
,
14
]. However, the uncertainty quantification over OOD region highly relies
on how neural network generalizes [
15
]. As the prior knowledge of Q-function is hard to acquire
and insert into the neural network, the generalization is unlikely reliable to facilitate meaningful
uncertainty quantification [16]. Notably, all these works are model-free.
Model-based offline RL optimizes policy based on a constructed dynamics model. Compared to the
model-free approaches, one prominent advantage is that the prior knowledge of dynamics is easier
to access. First, the generic prior like smoothness widely exists in various domains [
17
]. Second,
the sufficiently learned dynamics models for relevant tasks can act as a data-driven prior for the
concerned task [
18
20
]. With richer prior knowledge, the uncertainty quantification for dynamics is
more trustworthy. Similar to the model-free approach, the dynamics uncertainty can be incorporated
to find reliable policy beyond data coverage. However, an additional challenge is how to characterize
the accumulative impact of dynamics uncertainty on the long-term reward, as the system dynamics is
with entirely different meaning compared to the reward or Q-value.
Although existing model-based offline RL literature theoretically bounds the impact of dynamics
uncertainty on final performance, their practical variants characterize the impact through reward
penalty [
6
,
21
,
22
]. Concretely, the reward function is penalized by the dynamics uncertainty for each
state-action pair [
21
], or the agent is enforced to a low-reward absorbing state when the dynamics
uncertainty exceeds a certain level [
6
]. While optimizing policy in these constructed MDPs stimulates
anti-uncertainty behavior, the final policy tends to be over-conservative. For example, even the
transition dynamics for a state-action pair is ambiguous among several possible candidates, these
candidates may generate the states from which the system evolves similarly.
2
Then, such a state-action
pair should not be treated specially.
Motivated by the above intuition, we propose pessimism-modulated dynamics belief for model-based
offline RL. In contrast with the previous approaches, the dynamics uncertainty is not explicitly
quantified. To characterize its impact, we maintain a belief distribution over system dynamics, and
the policy is evaluated/optimized through biased sampling from it. The sampling procedure, biased
towards pessimism, is derived based on an alternating Markov game (AMG) formulation of offline
RL. We formally show that the biased sampling naturally induces an updated dynamics belief with
policy-dependent reweighting factor, termed Pessimism-Modulated Dynamics Belief. Besides, the
degree of pessimism is monotonously determined by the hyperparameters in sampling procedure.
The considered AMG formulation can be regarded as a generalization of robust MDP, which is
proposed as a surrogate to optimize the percentile performance in face of dynamics uncertainty
[
23
,
24
]. However, robust MDP suffers from two significant shortcomings: 1) The percentile criterion
is over-conservative since it fixates on a single pessimistic dynamics instance [
25
,
26
]; 2) Robust
MDP is constructed based on an uncertainty set, and the improper choice of uncertainty set would
further aggravate the degree of conservatism [
27
,
28
]. The AMG formulation is kept from these
shortcomings. To solve the AMG, we devise an iterative regularized policy optimization algorithm,
with guarantee of monotonous improvement under certain condition. To make practical, we further
derive an offline RL algorithm to approximately find the solution, and empirically evaluate it on the
offline RL benchmark D4RL. The results show that the proposed approach obviously outperforms
previous state-of-the-art (SoTA) in 9 out of 18 environment-dataset configurations and performs
competitively in the rest, without tuning hyperparameters for each task. The proof of theorems in this
paper are presented in Appendix B.
2 Preliminaries
Markov Decision Process (MDP)
A MDP is depicted by the tuple
(S,A, T, r, ρ0, γ)
, where
S,A
are state and action spaces,
T(s0|s, a)
is the transition probability,
r(s, a)
is the reward function,
ρ0(s)
is the initial state distribution, and
γ
is the discount factor. The goal of RL is to find the policy
π:s∆(A)that maximizes the cumulative discounted reward:
J(π, T ) = Eρ0,T"
X
t=0
γtr(st, at)#,(1)
2Or from these states, the system evolves differently but generates similar rewards.
2
where
∆(·)
denotes the probability simplex. In typical RL paradigm, this is done via actively
interacting with environment.
Offline RL
In offline setting, the environment is unaccessible, and only a static dataset
D=
{(s, a, r, s0)}
is provided, containing the previously logged data samples under an unknown behavior
policy. Offline RL aims to optimize the policy by solely leveraging the offline dataset.
To simplify the presentation, we assume the reward function
r
and initial state distribution
ρ0
are
known. Then, the system dynamics is unknown only in terms of the transition probability
T
. Note
that the considered formulation and the proposed approach can be easily extend to the general case
without additional technical modification.
Robust MDP
With the offline dataset, a straightforward strategy is first learning a dynamics model
τ(s0|s, a)
and then optimizing policy via simulation. However, due to the limitedness of available data,
the learned model is inevitably imprecise. Robust MDP [
23
] is a surrogate to optimize policy with
consideration of the ambiguity of dynamics. Concretely, robust MDP is constructed by introducing an
uncertainty set
T={τ}
to contain plausible transition probabilities. If the uncertainty set includes
the true transition with probability of
(1 δ)
, the performance of any policy
π
in true MDP can
be lower bounded by
minτ∈T J(π, τ)
with probability of at least
(1 δ)
. Thus, the percentile
performance for the true MDP can be optimized by finding a solution to
max
πmin
τ∈T J(π, τ).(2)
Despite its popularity, Robust MDP suffers from two major shortcomings: First, the percentile
criterion overly fixates on a single pessimistic transition instance, especially when there are multiple
optimal policies for this transition but they lead to dramatically different performance for other
transitions [25, 26]. This behavior results in unnecessarily conservative policy.
Second, the level of conservatism can be further aggravated when the uncertainty set is inappropriately
constructed [
27
]. For a given policy
π
, the ideal situation is that
T
contains the
(1 δ)
proportion of
transitions with which the policy achieves higher performance than with the other
δ
proportion. Then,
minτ∈T J(π, τ)
is exactly the
δ
-quantile performance. This requires the uncertainty set to be policy-
dependent, and during policy optimization the uncertainty set should change accordingly. Otherwise,
if
T
is predetermined and fixed, it is possible to have
τ0/∈ T
with non-zero probability and satisfying
J(π, τ0)>minτ∈T J(π, τ)
, where
π
is the optimal policy for
(2)
. Then, adding
τ0
into
T
does
not affect the optimal solution of the problem
(2)
. This indicates that we are essentially optimizing a
δ0
-quantile performance, where
δ0
can be much smaller than
δ
. In literature, the uncertainty sets are
mostly predetermined before policy optimization [23, 29–31].
3 Pessimism-Modulated Dynamics Belief
In short, robust MDP is over-conservative due to the fixation on a single pessimistic transition instance
and the predetermination of uncertainty set. In this work, we strive to take the entire spectrum of
plausible transitions into account, and let the algorithm by itself determine which part deserves more
attention. To this end, we consider an alternating Markov game formulation of offline RL, based on
which the proposed offline RL approach is derived.
3.1 Formulation
Alternating Markov game (AMG)
The AMG is a specialization of two-player zero-sum game,
depicted by
(S,¯
S,A,¯
A, G, r, ρ0, γ)
. The game starts from a state sampled from
ρ0
, then two players
alternatively choose actions
a∈ A
and
¯a¯
A
under states
s∈ S
and
¯s¯
S
, along with the game
transition defined by
G(¯s|s, a)
and
G(s|¯s, ¯a)
. At each round, the primary player receives reward
r(s, a)and the secondary player receives its negative counterpart r(s, a).
Offline RL as AMG
We formulate the offline RL problem as an AMG, where the primary player
optimizes a reliable policy for our concerned MDP in face of stochastic disturbance from the
secondary player. The AMG is constructed by augmenting the original MDP. As both have the
transition probability, we use game transition and system transition to differentiate them.
3
For the primary player, its state space
S
, action space
A
and reward function
r(s, a)
are same with
those in the original MDP. After the primary player acts, the game emits a
N
-size set of system
transition candidates
Tsa
, which later acts as the state of secondary player. Formally,
Tsa
is generated
according to
G(¯s=Tsa|s, a) = Y
τsa∈T sa
Psa
T(τsa),(3)
where
τsa(·)
re-denotes the plausible system transition
τ(·|s, a)
for short, and
Psa
T
is a given belief
distribution over
τsa
. According to
(3)
, the elements in
Tsa
are independent and identically dis-
tributed samples following
Psa
T
. The major difference to uncertainty set in robust MDP is that the set
introduced here is unfixed and stochastic for each step. To distinguish with uncertainty set, we call
it candidate set. The belief distribution
Psa
T
can be chosen arbitrarily to incorporate knowledge of
system transition. Particularly, when the prior distribution of system transition is accessible,
Psa
T
can
be obtained as the posterior by integrating the prior and the evidence Dthrough Bayes’ rule.
The secondary player receives the candidate set
Tsa
as state. Thus, its state space can be denoted by
¯
S= ∆N(S)
, i.e., the n-fold Cartesian product of probability simplex over
S
. Note that the state
Tsa
also takes the role of action space, i.e.,
¯
A=Tsa
, meaning that the action of secondary player is to
choose a system transition from the candidate set. Given the chosen
τsa ∈ T sa
, the game evolves by
sampling τsa, i.e.,
G(s0|¯s=Tsa,¯a=τsa) = τsa(s0),(4)
and the primary player receives
s0
to continue the game. In the following, we use
PN
T(Tsa)
to
compactly denote the game transition
G(¯s=Tsa|s, a)
in
(3)
, and omit the superscript
sa
in
τsa
,
Tsa and Psa
Twhen it is clear from the context.
For the above AMG, we consider a specific policy (explained below) for the secondary player, such
that the cumulative discounted reward of the primary player with policy πcan be written as:
J(π) := E
ρ0,PN
T
bminck
τ0∈T0"E
τ0,PN
T
bminck
τ1∈T1· · · "E
τ"
X
t=0
γtr(st, at)###,(5)
where the subscripts of
τ
and
T
denote time step, the expectation is over
s0ρ0
,
st>0
τt1(·|st1, at1), atπ(·|st)
and
TtPN
T
, and the operator
bminck
x∈X f(x)
denotes finding
k
th minimum of
f(x)
over
x∈ X
. The policy of secondary player is implicitly defined by the
operator
bminck
x∈X f(x)
. When changing
k∈ {1,2,· · · , N}
, the secondary player exhibits various
degree of adversarial or aggressive disturbance to the future reward. From the view of original MDP,
this behavior raises flexible tendency ranging from pessimism to optimism when evaluating policy
π
.
The distinctions between the introduced AMG and the robust MDP are twofold: 1) With a belief
distribution over transitions, robust MDP will select only part of its supports into uncertainty set, and
the set elements are treated indiscriminatingly. It indicates that both the possibility of transitions out
of the uncertainty set and the relative likelihood of transitions within the uncertainty set are discarded.
However, in the AMG, the candidate set simply contains samples drawn from the belief distribution,
implying no information drop in an average sense. Intuitively, by keeping richer knowledge of the
system, the performance evaluation is more exact and away from excessive conservatism; 2) In
robust MDP, the level of conservatism is expected to be controlled by its hyperparameter
δ
. However,
as illustrated in Section 2, a smaller
δ
does not necessarily corresponds to a more conservative
performance evaluation, due to the extra impact from uncertainty set construction. In contrast, for
the AMG, the degree of conservatism is adjusted by the size of candidate size
N
and the order of
minimum
k
. When changing values of
k
or
N
, the impact to performance evaluation is ascertained,
as formalized in Theorem 3.
To evaluate J(π), we define the following Bellman backup operator:
Bπ
N,kQ(s, a) = r(s, a) + γEPN
Thbminck
τ∈T Eτ[Q(s0, a0)] i.(6)
As the operator depends on
N
,
k
and we emphasize pessimism in offline RL, we call it
(N, k)
-
pessimistic Bellman backup operator. Compared to the standard Bellman backup operator in Q-
learning,
Bπ
N,k
additionally includes the expectation over
T PN
T
and the
k
-minimum operator over
T
. Despite these differences, we prove that
Bπ
N,k
is still a contraction mapping, based on which
J(π)
can be easily evaluated.
4
Theorem 1
(Policy Evaluation)
.
The
(N, k)
-pessimistic Bellman backup operator
Bπ
N,k
is a contrac-
tion mapping. By starting from any function
Q:S × A R
and repeatedly applying
Bπ
N,k
, the
sequence converges to Qπ
N,k, with which we have J(π) = Eρ0 Qπ
N,k(s0, a0).
3.2 Pessimism-Modulated Dynamics Belief
With the converged Q-value, we are ready to establish a more direct connection between the AMG
and the original MDP. The connection appears as the answer to a natural question: the calculation of
(6)
encompasses biased samples from the dynamics belief distribution, can we treat these samples
as the unbiased ones sampling from another belief distribution? We give positive answer in the
following theorem.
Theorem 2
(Equivalent MDP with Pessimism-Modulated Dynamics Belief)
.
The alternating Markov
game in
(5)
is equivalent to the MDP with tuple
(S,A,e
T , r, ρ0, γ)
, where the transition probability
e
T(s0|s, a) = Ee
Psa
T[τsa(s0)] is defined with the reweighted belief distribution e
Psa
T:
e
Psa
T(τsa)wEτsa Qπ
N,k(s0, a0);k, NPsa
T(τsa),(7)
w(x;k, N) = F(x)k11F(x)Nk,(8)
and
F(·)
is cumulative density function. Furthermore, the value of
w(x;k, N)
first increases and
then decreases with x, and its maximum is obtained at the k1
N1quantile, i.e., x=F1k1
N1.
In right-hand side of
(7)
,
τsa
itself is random following the belief distribution, thus
Eτsa Qπ
N,k(s0, a0)
, as a functional of
τsa
, is also a random variable, whose cumulative density
function is determined by the belief distribution
Psa
T
. Intuitively, we can treat
Eτsa Qπ
N,k(s0, a0)
as a pessimism indicator for transition τsa, with larger value indicating less pessimism.
From Theorem 2, the maximum of
w
is obtained at
τ:FEτQπ
N,k(s0, a0)=k1
N1
, i.e., the
transition with
k1
N1
-quantile pessimism indicator. Besides, when
Eτsa Qπ
N,k(s0, a0)
departs the
k1
N1
quantile, the reweighting coefficient for its
τsa
decreases. Considering the effect of
w
to
e
Psa
T
and the equivalence between the AMG and the refined MDP, we can say that
J(π)
is a soft
percentile performance. Compared to the standard percentile criteria,
J(π)
is derived by reshaping
belief distribution towards concentrating around a certain percentile, rather than fixating on a single
percentile point. Due to this feature, we term e
Psa
TPessimism-Modulated Dynamics Belief (PMDB).
Lastly, recall that all the above derivations are with hyperparameters
k
and
N
, we present the
monotonicity of
Qπ
N,k
over them in Theorem 3. Furthermore, by combining Theorem 1 with Theorem
3, we conclude that J(π)decreases with Nand increases with k.
Theorem 3 (Monotonicity).The converged Q-function Qπ
N,k are with the following properties:
Given any k, the Q-function Qπ
N,k element-wisely decreases with N∈ {k, k + 1,· · · }.
Given any N, the Q-function Qπ
N,k element-wisely increases with k∈ {1,2,· · · , N}.
The Q-function Qπ
N,N element-wisely increases with N.
k
1 2 3 4 5 6 7 8
N
Robust
MDP
.
MBRL
N=k
Figure 1: Monotonicity of Q-values. The
arrows indicate the directions along which
Q-values increase.
Remark 1
(Special Cases). For
N=k= 1
, we have
e
Psa
T=Psa
T
. Then, the performance is evaluated through sam-
pling the initial belief distribution. This resembles the com-
mon methodology in model-based RL (MBRL), with dynam-
ics belief defined by the uniform distribution over dynamics
model ensembles. For
k=δ(N1) + 1
and
N→ ∞
,
e
Psa
T
asymptotically collapses to be a delta function. Then,
J(π)
degrades to fixate on a single transition instance. It is equiv-
alent to the robust MDP with the uncertainty set constructed
as
nτsa :Psa
T(τsa)>0,Eτsa Qπ
N,k(s0, a0)F1(δ)o
.
In this sense, the AMG is a successive interpolation between
MBRL and robust MDP.
5
摘要:

Model-BasedOfineReinforcementLearningwithPessimism-ModulatedDynamicsBeliefKaiyangGuoYunfengShaoYanhuiGengHuaweiNoah'sArkLabAbstractModel-basedofinereinforcementlearning(RL)aimstondhighlyrewardingpolicy,byleveragingapreviouslycollectedstaticdatasetandadynamicsmodel.Whilethedynamicsmodellearnedthr...

展开>> 收起<<
Model-Based Offline Reinforcement Learning with Pessimism-Modulated Dynamics Belief Kaiyang GuoYunfeng Shao Yanhui Geng.pdf

共30页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:30 页 大小:1.58MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 30
客服
关注