Data-Efficient Pipeline for Offline Reinforcement Learning with Limited Data Allen NieYannis Flet-Berliac Deon R. Jordan

2025-05-06 0 0 1.91MB 32 页 10玖币
侵权投诉
Data-Efficient Pipeline for Offline Reinforcement
Learning with Limited Data
Allen NieYannis Flet-Berliac Deon R. Jordan
William Steenbergen Emma Brunskill
Department of Computer Science
Stanford University
*anie@stanford.edu
Abstract
Offline reinforcement learning (RL) can be used to improve future performance by
leveraging historical data. There exist many different algorithms for offline RL, and
it is well recognized that these algorithms, and their hyperparameter settings, can
lead to decision policies with substantially differing performance. This prompts
the need for pipelines that allow practitioners to systematically perform algorithm-
hyperparameter selection for their setting. Critically, in most real-world settings,
this pipeline must only involve the use of historical data. Inspired by statistical
model selection methods for supervised learning, we introduce a task- and method-
agnostic pipeline for automatically training, comparing, selecting, and deploying
the best policy when the provided dataset is limited in size. In particular, our
work highlights the importance of performing multiple data splits to produce more
reliable algorithm-hyperparameter selection. While this is a common approach
in supervised learning, to our knowledge, this has not been discussed in detail
in the offline RL setting. We show it can have substantial impacts when the
dataset is small. Compared to alternate approaches, our proposed pipeline outputs
higher-performing deployed policies from a broad range of offline policy learning
algorithms and across various simulation domains in healthcare, education, and
robotics. This work contributes toward the development of a general-purpose
meta-algorithm for automatic algorithm-hyperparameter selection for offline RL.
1 Introduction
Offline/batch reinforcement learning has the potential to learn better decision policies from existing
real-world datasets on sequences of decisions made and their outcomes. In many of these settings,
tuning methods online is infeasible and deploying a new policy involves time, effort and potential
negative impact. Many of the existing datasets for applications that may benefit from offline RL may
be fairly small in comparison to supervised machine learning. For instance, the MIMIC intensive care
unit dataset on sepsis that is often studied in offline RL has 14k patients (Komorowski et al.,2018),
the number of students frequently interacting with an online course will often range from hundreds to
tens of thousands (Bassen et al.,2020), and the number of demonstrations collected from a human
operator manipulating a robotic arm is often on the order of a few hundred per task (Mandlekar
et al.,2018). In these small data regimes, recent studies (Mandlekar et al.,2021;Levine et al.,
2020) highlight that with limited data, the selection of hyperparameters using the training set is often
challenging. Yet hyperparameter selection also has a substantial influence on the resulting policy’s
performance, particularly when the algorithm leverages deep neural networks.
One popular approach to address this is to learn policies from particular algorithm-hyperparameter
pairs on a training set and then use offline policy selection, which selects the best policy given a
validation set (Thomas et al.,2015a,2019;Paine et al.,2020;Kumar et al.,2021). However, when
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.08642v2 [cs.LG] 12 Jan 2023
Common Practices Non-Markov
Env
Data
Efficient
(re-train)
Compare
Across
OPL
Considers
Evaluation
Variation
Considers
Training
Variation
Policy selection (1 split)
Internal Objective / TD-Error
(Thomas et al.,2015b,2019)(depends) 7 7 7 7
OPE methods
(Komorowski et al. (2018);
Paine et al. (2020) (depends) 7 3 7 7
OPE + BCa Val.
(Thomas et al.,2015b)(depends) 7 3 3 7
BVFT
(Xie and Jiang,2021)7 7 7 7 7
BVFT + OPE
(Zhang and Jiang,2021)7 7 3 3 7
Q-Function Workflow
(Kumar et al.,2021)7 3 7 7 7
Ours: Aiselection (multi-split)
Cross-Validation 3 3 3 3 3
Repeated Random
Subsampling 3 3 3 3 3
Table 1: A summary of commonly used approaches for choosing a deployment policy from a fixed
offline RL dataset. We define Data Efficient as: the approach assumes the algorithm can be re-trained
on all data points; (depends) as: depends on whether the underlying OPL or OPE methods make
explicit Markov assumption or not.
the dataset is limited in size, this approach can be limited: (a) if the validation set happens to have
no or very few good/high-reward trajectories, then trained policies cannot be properly evaluated;
(b) if the training set has no or very few such trajectories, then no good policy behavior can be
learned through any policy learning algorithm; and (c) using one fixed training dataset is prone to
overfitting the hyperparameters on this one dataset and different hyperparameters could be picked
if the training set changes. One natural solution to this problem is to train on the entire dataset and
compare policy performance on the same dataset, which is often referred to as the internal objective
approach. In Appendix A.1 we conduct a short experiment using D4RL where this approach fails due
to the common issue of Q-value over-estimation (Fujimoto et al.,2019).
There has been much recent interest in providing more robust methods for offline RL. Many rely on
the workflow just discussed, where methods are trained on one dataset and Offline Policy Evaluation
(OPE) is used to do policy selection (Su et al.,2020;Paine et al.,2020;Zhang and Jiang,2021;Kumar
et al.,2021;Lee et al.,2021;Tang and Wiens,2021;Miyaguchi,2022). Our work highlights the
impact of a less studied issue: the challenge caused by data partitioning variance. We first motivate
the need to account for train/validation partition randomness by showing the wide distribution of
OPE scores the same policy can obtain on different subsets of data or the very different performing
policies the same algorithm and hyperparameters can learn depending on different training set
partitions. We also prove a single partition can have a notable failure rate in identifying the best
algorithm-hyperparameter to learn the best policy.
We then introduce a general pipeline for algorithm-hyperparameters (
AH
) selection and policy de-
ployment that: (a) uses repeated random sub-sampling (RRS) with replacement of the dataset to
perform
AH
training, (b) uses OPE on the validation set, (c) computes aggregate statistics over the
RRS splits to inform
AH
selection, and (d) allows to use the selected
AH
to retrain on the entire dataset
to obtain the deployment policy. Though such repeated splitting is common in supervised learning, its
impact and effect have been little studied in the offline RL framework. Perhaps surprisingly, we show
that our simple pipeline leads to substantial performance improvements in a wide range of popular
benchmark tasks, including D4RL (Fu et al.,2020) and Robomimic (Mandlekar et al.,2021).
2
2 Related work
Offline Policy Learning (OPL).
In OPL, the goal is to use historical data from a fixed behavior
policy
πb
to learn a reward-maximizing policy in an unknown environment (Markov Decision Process,
defined in Section 3). Most work studying the sampling complexity and efficiency of offline RL (Xie
and Jiang,2021;Yin et al.,2021) do not depend on the structure of a particular problem, but empirical
performance may vary with some pathological models that are not necessarily Markovian. Shi et al.
(2020) have precisely developed a model selection procedure for testing the Markovian hypothesis
and help explain different performance on different models and MDPs. To address this problem, it is
inherently important to have a fully adaptive characterization in RL because it could save considerable
time in designing domain-specific RL solutions (Zanette and Brunskill,2019). As an answer to a
variety of problems, OPL is rich with many different methods ranging from policy gradient (Liu et al.,
2019), model-based (Yu et al.,2020;Kidambi et al.,2020), to model-free methods (Siegel et al.,2020;
Fujimoto et al.,2019;Guo et al.,2020;Kumar et al.,2020) each based on different assumptions on
the system dynamics. Practitioners thus dispose of an array of algorithms and corresponding hyperpa-
rameters with no clear consensus on a generally applicable evaluation tool for offline policy selection.
Offline Policy Evaluation (OPE).
OPE is concerned with evaluating a target policy’s performance
using only pre-collected historical data generated by other (behavior) policies (Voloshin et al.,2021).
Each of the many OPE estimators has its unique properties, and in this work, we primarily consider
two main variants (Voloshin et al.,2021): Weighted Importance Sampling (WIS) (Precup,2000) and
Fitted Q-Evaluation (FQE) (Le et al.,2019). Both WIS and FQE are sensitive to the partitioning
of the evaluation dataset. WIS is undefined on trajectories where the target policy does not overlap
with the behavior policy and self-normalizes with respect to other trajectories in the dataset. FQE
learns a Q-function using the evaluation dataset. This makes these estimators very different from
mean-squared errors or accuracy in the supervised learning setting – the choice of partitioning will
first affect the function approximation in the estimator and then cascade down to the scores they
produce.
Offline Policy Selection (OPS).
Typically, OPS is approached via OPE, which estimates the expected
return of candidate policies. Zhang and Jiang (2021) address how to improve policy selection in
the offline RL setting. The algorithm builds on the Batch Value-Function Tournament (BVFT) (Xie
and Jiang,2021) approach to estimating the best value function among a set of candidates using
piece-wise linear value function approximations and selecting the policy with the smallest projected
Bellman error in that space. Previous work on estimator selection for the design of OPE methods
include Su et al. (2020); Miyaguchi (2022) while Kumar et al. (2021); Lee et al. (2021); Tang and
Wiens (2021); Paine et al. (2020) focus on offline hyperparameter tuning. Kumar et al. (2021) give
recommendations on when to stop training a model to avoid overfitting. The approach is exclusively
designed for Q-learning methods with direct access to the internal Q-functions. On the contrary,
our pipeline does policy training, selection, and deployment on any offline RL method, not reliant
on the Markov assumption, and can select the best policy with potentially no access to the internal
approximation functions (black box). We give a brief overview of some OPS approaches in Table 1.
3 Background and Problem Setting
We define a stochastic Decision Process
M=hS, A, T, r, γi
, where
S
is a set of states;
A
is a set
of actions;
T
is the transition dynamics (which might depend on the full history);
r
is the reward
function; and
γ(0,1)
is the discount factor. Let
τ={si, ai, s0
i, ri}L
i=0
be the trajectory sampled
from
π
on
M
. The optimal policy
π
is the one that maximizes the expected discounted return
V(π) = Eτρπ[G(τ)] where G(τ) = P
t=0 γtrtand ρπis the distribution of τunder policy π. For
simplicity, in this paper we assume policies are Markov
π:SA
, but it is straightforward to
consider policies that are a function of the full history. In an offline RL problem, we take a dataset:
D={τi}n
i=1
, which can be collected by one or a group of policies which we refer to as the behavior
policy
πb
on the decision process
M
. The goal in offline/batch RL is to learn a decision policy
π
from
a class of policies with the best expected performance
Vπ
for future use. Let
Ai
to denote an
AH
pair,
i.e. an offline policy learning algorithm and its hyperparameters and model architecture. An offline
policy estimator takes in a policy
πe
and a dataset
D
, and returns an estimate of its performance:
b
V: Π × D R
. In this work, we focus on two popular Offline Policy Evaluation (OPE) estimators:
Importance Sampling (IS) (Precup,2000) and Fitted Q-Evaluation (FQE) (Le et al.,2019) estimators.
We refer the reader to Voloshin et al. (2021) for a more comprehensive discussion.
3
(a) (b)
OPE Estimates on 10 Partitions True Reward of Policy trained on 10 Partitions
Figure 1: True performance and evaluation of 6
Ai
pairs on the Sepsis-POMDP (N=1000) domain. (a)
shows the OPE estimations and (b) shows the variation in terms of true performance. The variations
are due to the different
AH
pairs of the policies but also to the sensitivity to the training/validation splits.
4 The Challenge of Offline RL AiSelection
An interesting use-case of offline RL is when domain experts have access to an existing dataset
(with potentially only a few hundred trajectories) about sequences of decisions made and respective
outcomes, with the hope of leveraging the dataset to learn a better decision policy for future use.
In this setting, the user may want to consider many options regarding the type of RL algorithm
(model-based, model-free, or direct policy search), hyperparameter, or deep network architecture to
use.
Automated algorithm selection is important because different
Ai
(different
AH
pairs) may learn
very diverse policies, each with significantly different performance
VAi
. Naturally, one can expect
that various algorithms lead to diverse performance, but using a case-study experiment on a sepsis
simulator (Oberst and Sontag,2019), we observe in Figure 1(b) that the sensitivity to hyperparameter
selection is also substantial (cf. different average values in box plots for each method). For example,
MBS-QI (Liu et al.,2020) learns policies ranging from over -12 to -3 in their performance, depending
on the hyperparameters chosen.
Precisely, to address hyperparameter tuning, past work often relies on executing the learned policies
in the simulator/real environment. When this is not feasible, as in many real-world applications,
including our sepsis dataset example, where the user may only be able to leverage existing historical
data, we have no choice but to rely on off-policy evaluation. Prior work (Thomas et al.,2015b;
Farajtabar et al.,2018;Thomas et al.,2019;Mandlekar et al.,2021) have suggested doing so using a
hold-out method, after partitioning the dataset into training and validation sets.
Unfortunately, the partitioning of the dataset itself may result in substantial variability in the training
process (Dietterich,1998). We note that this problem is particularly prominent in offline RL where
high-reward trajectories are sparse and affect both policy learning and policy evaluation. To explore
this hypothesis, we consider the influence of the train/validation partition in the same sepsis domain,
and we evaluate the trained policies using the Weighted Importance Sampling (WIS) (Precup,2000)
estimator. Figure 1(a) shows the policies have drastically different OPE estimations with sensitivity
to randomness in the dataset partitioning. We can observe the same phenomena in Figure 1(b) with
largely different true performances depending on the dataset splitting for most of the policies
Ai
. This
is also illustrated on the left sub-figure of Figure 4where in the case where a single train-validation
split is used, an Aithat yields lower-performing policies will often be selected over those that yield
higher-performing policies when deployed.
4.1 Repeated Experiments for Robust Hyperparameter Evaluation in Offline RL
We now demonstrate why it is important to conduct repeated random sub-sampling on the dataset in
offline RL. Consider a finite set of
J
offline RL algorithms
A
. Let the policy produced by algorithm
Aj
on training dataset
D
be
πj
, its estimated performance on a validation set
ˆ
Vπj
, and its true
(unknown) value be
Vπj.
Denote the true best resulting policy as
πj= arg maxjVπj
and the
corresponding algorithm
Aj
. Let the best policy picked based on its validation set performance as
πˆ
j= arg maxjˆ
Vπjand the corresponding algorithm Aˆ
j.
Theorem 1.
There exist stochastic decision processes and datasets such that (i) using a single
train/validation split procedure that selects an algorithm-hyperparameter with the best performance
4
on the validation dataset will select a suboptimal policy and algorithm with significant finite prob-
ability,
P(πˆ
j6=πj)C
, with corresponding substantial loss in performance
O(Vmax)
, and, in
contrast, (ii) selecting the algorithm-hyperparameter with the best average validation performance
across
Ns
train/validation splits will select the optimal algorithm and policy with probability 1:
limNs→∞ P(πˆ
j=πj)1.
Proof Sketch. Due to space constraints we defer the proof to Appendix A.3. Briefly, the proof
proceeds by proof by example through constructing a chain-like stochastic decision process and
considers a class of algorithms that optimize over differing horizons (see e.g. Jiang et al. (2015);
Cheng et al. (2021); Mazoure et al. (2021)). The behavior policy is uniformly random meaning
that trajectories with high rewards are sparse. This means there is a notable probability that in a
single partition of the dataset, the resulting train and/or validation set may not contain a high reward
trajectory, making it impossible to identify that a full horizon algorithm, and resulting policy, is
optimal.
In the proof and our experiments, we focus on when the training and validation sets are of equal
size. If we use an uneven split, such as
80/20%
, the failure probability can further increase if only a
single partition of the dataset is used. We provide an illustrative example in the Appendix. Note that
Leave-one-out Cross-Validation (LooCV) will also fail in our setting if we employ, as we do in our
algorithm, WIS, because as a biased estimator, WIS will return the observed return of the behavior
policy if averaging over a single trajectory, independent of the target policy to be evaluated. We
explain this further in Appendix A.11.
5SSR: Repeated Random Sampling for AiSelection and Deployment
In this paper, we are interested in the following problem: If offline RL training and evaluation are
very sensitive to the partitioning of the dataset, especially in small data regimes, how can we reliably
produce a final policy that we are confident is better than others and can be reliably deployed in
the real-world?
Instead of considering the sensitivity to data partition as an inherent obstacle for offline policy
selection, we view this as statistics to leverage for
Ai
selection. We propose a general pipeline: Split
Select Retrain (
SSR
) (of which we provide a pseudo-code in Algorithm 1, Appendix A.4) to reliably
optimize for a good deployed policy given only: an offline dataset, an input set of
AH
pairs and an
off-policy evaluation (OPE) estimator. This deployment approach leverages the random variations
created by dataset partitioning to select algorithms that perform better on average using a robust
hyperparameter evaluation approach which we develop below.
First, we split and create different partitions of the input dataset. For each train/validation split,
each algorithm-hyperparameter (
AH
) is trained on the training set and evaluated using the input
OPE method to yield an estimated value on the validation set. These estimated evaluations are then
averaged, and the best
AH
pair (
A
) is selected as the one with the highest average score. Now the
last step of the SSR pipeline is to re-use the entire dataset to train one policy πusing A.
Repeated Random Sub-sampling (RRS).
As Theorem 1suggests, one should ensure a sufficient
amount of trajectories in the evaluation partition to lower the failure rate
C
. We propose to create RRS
train-validation partitions. This approach has many names in the statistical model selection literature,
such as Predictive Sample Reuse Method (Geisser,1975), Repeated Learning-Test Method (Burman,
1989) or Monte-Carlo Cross-Validation (Dubitzky et al.,2007). It has also been referred to as
Repeated Data Splitting (Chernozhukov et al.,2018) in the heterogeneous treatment effect literature.
We randomly select trajectories in
D
and put them into into two parts:
Rtrain
and
Rvalid
. We repeat this splitting process
K
times to generate paired datasets:
(Rtrain
1, Rvalid
1),(Rtrain
2, Rvalid
2), ..., (Rtrain
K, Rvalid
K)
. We compute the generalization performance
estimate as follows:
GA,RSK=1
K
K
X
k=1 ˆ
V(A(Rtrain
k); Rvalid
k)(1)
A key advantage of overlap partitioning is that it maintains the size of the validation dataset as
K
increases. This might be favorable since OPE estimates are highly dependent on the state-action
coverage of the validation dataset – the more data in the validation dataset, the better OPE estimators
can evaluate a policy’s performance. As
K→ ∞
, RRS approaches the leave-
p
-out cross-validation
5
摘要:

Data-EfcientPipelineforOfineReinforcementLearningwithLimitedDataAllenNieYannisFlet-BerliacDeonR.JordanWilliamSteenbergenEmmaBrunskillDepartmentofComputerScienceStanfordUniversity*anie@stanford.eduAbstractOfinereinforcementlearning(RL)canbeusedtoimprovefutureperformancebyleveraginghistoricaldata....

展开>> 收起<<
Data-Efficient Pipeline for Offline Reinforcement Learning with Limited Data Allen NieYannis Flet-Berliac Deon R. Jordan.pdf

共32页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:32 页 大小:1.91MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 32
客服
关注