2 Related work
Offline Policy Learning (OPL).
In OPL, the goal is to use historical data from a fixed behavior
policy
πb
to learn a reward-maximizing policy in an unknown environment (Markov Decision Process,
defined in Section 3). Most work studying the sampling complexity and efficiency of offline RL (Xie
and Jiang,2021;Yin et al.,2021) do not depend on the structure of a particular problem, but empirical
performance may vary with some pathological models that are not necessarily Markovian. Shi et al.
(2020) have precisely developed a model selection procedure for testing the Markovian hypothesis
and help explain different performance on different models and MDPs. To address this problem, it is
inherently important to have a fully adaptive characterization in RL because it could save considerable
time in designing domain-specific RL solutions (Zanette and Brunskill,2019). As an answer to a
variety of problems, OPL is rich with many different methods ranging from policy gradient (Liu et al.,
2019), model-based (Yu et al.,2020;Kidambi et al.,2020), to model-free methods (Siegel et al.,2020;
Fujimoto et al.,2019;Guo et al.,2020;Kumar et al.,2020) each based on different assumptions on
the system dynamics. Practitioners thus dispose of an array of algorithms and corresponding hyperpa-
rameters with no clear consensus on a generally applicable evaluation tool for offline policy selection.
Offline Policy Evaluation (OPE).
OPE is concerned with evaluating a target policy’s performance
using only pre-collected historical data generated by other (behavior) policies (Voloshin et al.,2021).
Each of the many OPE estimators has its unique properties, and in this work, we primarily consider
two main variants (Voloshin et al.,2021): Weighted Importance Sampling (WIS) (Precup,2000) and
Fitted Q-Evaluation (FQE) (Le et al.,2019). Both WIS and FQE are sensitive to the partitioning
of the evaluation dataset. WIS is undefined on trajectories where the target policy does not overlap
with the behavior policy and self-normalizes with respect to other trajectories in the dataset. FQE
learns a Q-function using the evaluation dataset. This makes these estimators very different from
mean-squared errors or accuracy in the supervised learning setting – the choice of partitioning will
first affect the function approximation in the estimator and then cascade down to the scores they
produce.
Offline Policy Selection (OPS).
Typically, OPS is approached via OPE, which estimates the expected
return of candidate policies. Zhang and Jiang (2021) address how to improve policy selection in
the offline RL setting. The algorithm builds on the Batch Value-Function Tournament (BVFT) (Xie
and Jiang,2021) approach to estimating the best value function among a set of candidates using
piece-wise linear value function approximations and selecting the policy with the smallest projected
Bellman error in that space. Previous work on estimator selection for the design of OPE methods
include Su et al. (2020); Miyaguchi (2022) while Kumar et al. (2021); Lee et al. (2021); Tang and
Wiens (2021); Paine et al. (2020) focus on offline hyperparameter tuning. Kumar et al. (2021) give
recommendations on when to stop training a model to avoid overfitting. The approach is exclusively
designed for Q-learning methods with direct access to the internal Q-functions. On the contrary,
our pipeline does policy training, selection, and deployment on any offline RL method, not reliant
on the Markov assumption, and can select the best policy with potentially no access to the internal
approximation functions (black box). We give a brief overview of some OPS approaches in Table 1.
3 Background and Problem Setting
We define a stochastic Decision Process
M=hS, A, T, r, γi
, where
S
is a set of states;
A
is a set
of actions;
T
is the transition dynamics (which might depend on the full history);
r
is the reward
function; and
γ∈(0,1)
is the discount factor. Let
τ={si, ai, s0
i, ri}L
i=0
be the trajectory sampled
from
π
on
M
. The optimal policy
π
is the one that maximizes the expected discounted return
V(π) = Eτ∼ρπ[G(τ)] where G(τ) = P∞
t=0 γtrtand ρπis the distribution of τunder policy π. For
simplicity, in this paper we assume policies are Markov
π:S→A
, but it is straightforward to
consider policies that are a function of the full history. In an offline RL problem, we take a dataset:
D={τi}n
i=1
, which can be collected by one or a group of policies which we refer to as the behavior
policy
πb
on the decision process
M
. The goal in offline/batch RL is to learn a decision policy
π
from
a class of policies with the best expected performance
Vπ
for future use. Let
Ai
to denote an
AH
pair,
i.e. an offline policy learning algorithm and its hyperparameters and model architecture. An offline
policy estimator takes in a policy
πe
and a dataset
D
, and returns an estimate of its performance:
b
V: Π × D → R
. In this work, we focus on two popular Offline Policy Evaluation (OPE) estimators:
Importance Sampling (IS) (Precup,2000) and Fitted Q-Evaluation (FQE) (Le et al.,2019) estimators.
We refer the reader to Voloshin et al. (2021) for a more comprehensive discussion.
3