
quantify the uncertainty of Q-value with neural network ensembles [
12
], where the consistent Q-value
estimates indicate high confidence and can be plausibly used during learning process, even for OOD
state-action pairs [
13
,
14
]. However, the uncertainty quantification over OOD region highly relies
on how neural network generalizes [
15
]. As the prior knowledge of Q-function is hard to acquire
and insert into the neural network, the generalization is unlikely reliable to facilitate meaningful
uncertainty quantification [16]. Notably, all these works are model-free.
Model-based offline RL optimizes policy based on a constructed dynamics model. Compared to the
model-free approaches, one prominent advantage is that the prior knowledge of dynamics is easier
to access. First, the generic prior like smoothness widely exists in various domains [
17
]. Second,
the sufficiently learned dynamics models for relevant tasks can act as a data-driven prior for the
concerned task [
18
–
20
]. With richer prior knowledge, the uncertainty quantification for dynamics is
more trustworthy. Similar to the model-free approach, the dynamics uncertainty can be incorporated
to find reliable policy beyond data coverage. However, an additional challenge is how to characterize
the accumulative impact of dynamics uncertainty on the long-term reward, as the system dynamics is
with entirely different meaning compared to the reward or Q-value.
Although existing model-based offline RL literature theoretically bounds the impact of dynamics
uncertainty on final performance, their practical variants characterize the impact through reward
penalty [
6
,
21
,
22
]. Concretely, the reward function is penalized by the dynamics uncertainty for each
state-action pair [
21
], or the agent is enforced to a low-reward absorbing state when the dynamics
uncertainty exceeds a certain level [
6
]. While optimizing policy in these constructed MDPs stimulates
anti-uncertainty behavior, the final policy tends to be over-conservative. For example, even the
transition dynamics for a state-action pair is ambiguous among several possible candidates, these
candidates may generate the states from which the system evolves similarly.
2
Then, such a state-action
pair should not be treated specially.
Motivated by the above intuition, we propose pessimism-modulated dynamics belief for model-based
offline RL. In contrast with the previous approaches, the dynamics uncertainty is not explicitly
quantified. To characterize its impact, we maintain a belief distribution over system dynamics, and
the policy is evaluated/optimized through biased sampling from it. The sampling procedure, biased
towards pessimism, is derived based on an alternating Markov game (AMG) formulation of offline
RL. We formally show that the biased sampling naturally induces an updated dynamics belief with
policy-dependent reweighting factor, termed Pessimism-Modulated Dynamics Belief. Besides, the
degree of pessimism is monotonously determined by the hyperparameters in sampling procedure.
The considered AMG formulation can be regarded as a generalization of robust MDP, which is
proposed as a surrogate to optimize the percentile performance in face of dynamics uncertainty
[
23
,
24
]. However, robust MDP suffers from two significant shortcomings: 1) The percentile criterion
is over-conservative since it fixates on a single pessimistic dynamics instance [
25
,
26
]; 2) Robust
MDP is constructed based on an uncertainty set, and the improper choice of uncertainty set would
further aggravate the degree of conservatism [
27
,
28
]. The AMG formulation is kept from these
shortcomings. To solve the AMG, we devise an iterative regularized policy optimization algorithm,
with guarantee of monotonous improvement under certain condition. To make practical, we further
derive an offline RL algorithm to approximately find the solution, and empirically evaluate it on the
offline RL benchmark D4RL. The results show that the proposed approach obviously outperforms
previous state-of-the-art (SoTA) in 9 out of 18 environment-dataset configurations and performs
competitively in the rest, without tuning hyperparameters for each task. The proof of theorems in this
paper are presented in Appendix B.
2 Preliminaries
Markov Decision Process (MDP)
A MDP is depicted by the tuple
(S,A, T, r, ρ0, γ)
, where
S,A
are state and action spaces,
T(s0|s, a)
is the transition probability,
r(s, a)
is the reward function,
ρ0(s)
is the initial state distribution, and
γ
is the discount factor. The goal of RL is to find the policy
π:s→∆(A)that maximizes the cumulative discounted reward:
J(π, T ) = Eρ0,T,π "∞
X
t=0
γtr(st, at)#,(1)
2Or from these states, the system evolves differently but generates similar rewards.
2