works [
23
,
24
]. In offline MBRL, this issue is exacerbated, since the learned model can hardly be
globally accurate, due to the limited amount of offline data and the complexity of the control tasks.
Motivated by the objective-mismatch issue, we develop an iterative offline MBRL method, alternating
between training the dynamic model and the policy to maximize a lower bound of the true expected
return. This lower bound, leading to a weighted MLE objective for the dynamic-model training, is
relaxed to a tractable regularized objective for the policy learning. To train the dynamic model by
the proposed objective, we need to estimate the marginal importance weights (MIW) between the
offline-data distribution and the stationary state-action distribution of the current policy [
27
,
28
]. This
estimation tends to be unstable by standard approaches [e.g.,
29
,
30
], which require saddle-point
optimization. Instead, we propose a simple yet stable fixed-point-style method for MIW estimation,
which can be directly incorporated into our alternating training framework. With these considerations,
our method, offline Alternating Model-Policy Learning (AMPL), performs competitively on a wide
range of continuous-control offline RL datasets in the D4RL benchmark [
31
]. These empirical results
and ablation study show the efficacy of our proposed algorithmic designs.
2 Background
Markov decision process and offline RL.
A Markov decision process (MDP) is denoted by
M=
(S,A, P, r, γ, µ0)
, where
S
is the state space,
A
the action space,
P(s0|s, a) : S×S×A→[0,1]
the environmental dynamic,
r(s, a) : S×A→[−rmax, rmax]
the reward function,
γ∈[0,1)
the
discount factor, and µ0(s) : S→[0,1] the initial state-distribution.
For any policy
π(a|s)
, we denote its state-action distribution at timestep
t≥0
as
dP
π,t(s, a),
Pr (st=s, at=a|s0∼µ0, at∼π, st+1 ∼P, ∀t≥0) .
The (discounted) stationary state-action
distribution of πis denoted as dP
π,γ (s, a),(1 −γ)P∞
t=0 γtdP
π,t(s, a).
Denote
QP
π(s, a) = Eπ,P [P∞
t=0 γtr(st, at)|s0=s, a0=a]
as the action-value function of policy
πunder the dynamic P. The goal of RL is to find a policy πmaximizing the expected return
J(π, P ),(1 −γ)Es∼µ0,a∼π(·|s)QP
π(s, a)=E(s,a)∼dP
π,γ [r(s, a)] .(1)
In offline RL, the policy
π
and critic
QP
π
are typically approximated by parametric functions
πφ
and
Qθ, respectively, with parameters φand θ. The critic Qθis trained by the Bellman backup
arg minθE(s,a,r,s0)∼Denv hQθ(s, a)−r(s, a) + γEa0∼πφ(· | s0)[Qθ0(s0, a0)]2i,(2)
where
Qθ0
is the target network [
12
,
13
]. The actor
πφ
is trained in the policy improvement step by
arg maxφEs∼Denv, a∼πφ(· | s)[Qθ(s, a)] ,(3)
where Denv denotes the offline dataset drawn from dP
πb,γ [2,32], with πbbeing the behavior policy.
Offline model-based RL.
In offline model-based RL algorithms, the true environmental dynamic
P∗
is typically approximated by a parametric function
b
P(s0|s, a)
in some function class
P
. With
the offline dataset Denv,b
Pis trained via the MLE [15,16,18] as
arg max b
P∈P E(s,a,s0)∼Denv hlog b
P(s0|s, a)i.(4)
Similarly, the reward function can be approximated by a parametric model
br
if assumed unknown.
With
b
P
and
br
, the true MDP
M
can be approximated by
c
M= (S,A,b
P , br, γ, µ0)
. We further define
dP∗
π,γ (s, a)
as the stationary state-action distribution induced by
π
on
P∗
(or MDP
M
), and
d
b
P
π,γ (s, a)
as that on the learned dynamic
b
P
(or MDP
c
M
). We approximate
dP∗
πφ,γ
by simulating
πφ
on
c
M
for a
short horizon
h
starting from state
s∈ Denv
, as in prior work [e.g.,
16
,
18
,
19
,
21
]. The resulting
transitions are stored in a replay buffer
Dmodel
, constructed similar to the off-policy RL [
33
,
9
]. To
better approximate
dP∗
πφ,γ
, sampling from
Denv
in Eqs. (2) and (3) is commonly replaced by sampling
from the augmented dataset
D=fDenv + (1 −f)Dmodel, f ∈[0,1]
, denoting sampling from
Denv
and Dmodel with probabilities fand 1−f, respectively. We follow Yu et al. [18] to use f= 0.5.
2