
more fine-grained characterization of the effects of regularization, which leads to novel insights about
how to design better regularizers, and existing DICE estimators are subsumed as special cases of our
method when we choose very simple regularizers (see Remark 3 in Section 5).
Fitted-Q Evaluation (FQE)
Outside the MIS literature, one can obtain return and value-function
estimation guarantees via FQE [DJW20; CJ19; LVY19; UIJKSX21]. However, it is well understood
that FQE and related approaches require Bellman-completeness-type assumptions, such as the
function class being closed under the Bellman operator. Even putting aside the difference between
completeness vs. realizability, we allow for a user-specified error-measuring distribution, which is
not available in FQE or any other existing method. The only distribution these methods are aware of
is the data distribution
dD
, and even so, FQE and variants rarely provide guarantees on
kbq−qπk2,dD
,
but often on the Bellman error (e.g.,
kbq− T πbqk2,dD
) instead [UIJKSX21], and obtaining guarantees
on a distribution of interest often requires multiple indirect translations and loose relaxations.
LSTDQ
Our analyses focus on general function approximation. When restricted to linear classes,
function estimation guarantees for qπunder dDcan be obtained by LSTDQ methods [LP03; BY09;
DNP14] when the function class only satisfies realizability of
qπ
[PKBK22]. However, this requires
an additional matrix invertibility condition (see Assumption 3 of [PKBK22]), and it is still unclear
what this condition corresponds to in general function approximation.
4
Moreover, many general
methods—including MIS [UHJ20] and other minimax methods [ASM08; XCJMA21]—coincide
with LSTDQ in the linear case, so the aforementioned results can be viewed as a specialized analysis
leveraging the properties of linear classes.
PRO-RL [ZHHJL22]
Our key proof techniques are adapted from [ZHHJL22], whose goal is offline
policy learning. They learn the importance weight function
wπ
for a near-optimal
π
, and provide
kbw−wπk2,dD
guarantees as an intermediate result. Despite using similar technical tools, our most
interesting and surprising results are in the value-function estimation setting, which is not considered
by [ZHHJL22]. Our novel algorithmic insights, such as incorporating error-measuring distributions
and approximate models in the regularizers, are also potentially useful in [ZHHJL22]’s policy learning
setting. Our analyses also reveal a number of important differences between OPE and offline policy
learning, which will be discussed in Appendix A.
3 Preliminaries
We consider off-policy evaluation (OPE) in Markov Decision Processes (MDPs). An MDP is specified
by its state space
S
, action space
A
, transition dynamics
P:S ×A → ∆(S)
(
∆(·)
is the probability
simplex), reward function
R:S × A → ∆([0,1])
, discount factor
γ∈[0,1)
, and an initial state
distribution
µ0∈∆(S)
. We assume
S
and
A
are finite and discrete, but their cardinalities can be ar-
bitrarily large. Given a target policy
π:S → ∆(A)
, a random trajectory
s0, a0, r0, s1, a1, r1, . . .
can
be generated as
s0∼µ0, at∼π(·|st), rt∼R(·|st, at), st+1 ∼P(·|st, at)
,
∀t≥0
; we use
Eπ
and
Pπ
to refer to expectation and probability under such a distribution. The expected discounted return
(or simply return) of
π
is
J(π) := Eπ[Ptγtrt]
. The Q-value function of
π
is the unique solution
of the Bellman equations
qπ=Tπqπ
, with the Bellman operator
Tπ:RS×A →RS×A
defined as
∀q∈RS×A,(Tπq)(s, a) := Er∼R(·|s,a)[r] + γ(Pπq)(s, a)
. Here
Pπ∈R|S×A|×|S×A|
is the state-
action transition operator of
π
, defined as
(Pπq)(s, a) := Es0∼P(·|s,a),a0∼π(·|s0)[q(s0, a0)]
. Functions
over S × A (such as q) are also treated as |S × A|-dimensional vectors interchangeably.
In OPE, we want to estimate
qπ
and other functions of interest based on a historical dataset collected
by a possibly different policy. As a standard simplification, we assume that the offline dataset
consisting of
n
i.i.d. tuples
{(si, ai, ri, s0
i)}n
i=1
sampled as
(si, ai)∼dD
,
r∼R(·|si, ai)
, and
s0
i∼P(·|si, ai)
. We call
dD
the (offline) data distribution. As another function of interest, the
(marginalized importance) weight function
wπ
is defined as
wπ(s, a) := dπ(s, a)/dD(s, a)
, where
dπ(s, a) = (1 −γ)P∞
t=0 γtPπ[st=s, at=a]
is the discounted state-action occupancy of
π
. For
technical convenience we assume
dD(s, a)>0∀s, a
, so that quantities like
wπ
are always well
defined and finite.
5
Similarly to
qπ
,
wπ
also satisfies a recursive equation, inherited from the Bellman
4
It is hinted by [UHJ20] that the invertibility is related to a loss minimization condition in MIS, but the
connection only holds for return estimation.
5
It will be trivial to remove this assumption at the cost of cumbersome derivations. Also, these density ratios
can still take prohibitively large values even if they are finite, and we will need to make additional boundedness
assumptions to enable finite-sample guarantees anyway, so their finiteness does not trivialize the analyses.
3