Data-Efﬁcient Pipeline for Ofﬂine Reinforcement Learning with Limited Data Allen NieYannis Flet-Berliac Deon R. Jordan

2025-05-06 0 0 1.91MB 32 页 10玖币

侵权投诉

Data-Efﬁcient Pipeline for Ofﬂine Reinforcement

Learning with Limited Data

Allen Nie∗Yannis Flet-Berliac Deon R. Jordan

William Steenbergen Emma Brunskill

Department of Computer Science

Stanford University

*anie@stanford.edu

Abstract

Ofﬂine reinforcement learning (RL) can be used to improve future performance by

leveraging historical data. There exist many different algorithms for ofﬂine RL, and

it is well recognized that these algorithms, and their hyperparameter settings, can

lead to decision policies with substantially differing performance. This prompts

the need for pipelines that allow practitioners to systematically perform algorithm-

hyperparameter selection for their setting. Critically, in most real-world settings,

this pipeline must only involve the use of historical data. Inspired by statistical

model selection methods for supervised learning, we introduce a task- and method-

agnostic pipeline for automatically training, comparing, selecting, and deploying

the best policy when the provided dataset is limited in size. In particular, our

work highlights the importance of performing multiple data splits to produce more

reliable algorithm-hyperparameter selection. While this is a common approach

in supervised learning, to our knowledge, this has not been discussed in detail

in the ofﬂine RL setting. We show it can have substantial impacts when the

dataset is small. Compared to alternate approaches, our proposed pipeline outputs

higher-performing deployed policies from a broad range of ofﬂine policy learning

algorithms and across various simulation domains in healthcare, education, and

robotics. This work contributes toward the development of a general-purpose

meta-algorithm for automatic algorithm-hyperparameter selection for ofﬂine RL.

1 Introduction

Ofﬂine/batch reinforcement learning has the potential to learn better decision policies from existing

real-world datasets on sequences of decisions made and their outcomes. In many of these settings,

tuning methods online is infeasible and deploying a new policy involves time, effort and potential

negative impact. Many of the existing datasets for applications that may beneﬁt from ofﬂine RL may

be fairly small in comparison to supervised machine learning. For instance, the MIMIC intensive care

unit dataset on sepsis that is often studied in ofﬂine RL has 14k patients (Komorowski et al.,2018),

the number of students frequently interacting with an online course will often range from hundreds to

tens of thousands (Bassen et al.,2020), and the number of demonstrations collected from a human

operator manipulating a robotic arm is often on the order of a few hundred per task (Mandlekar

et al.,2018). In these small data regimes, recent studies (Mandlekar et al.,2021;Levine et al.,

2020) highlight that with limited data, the selection of hyperparameters using the training set is often

challenging. Yet hyperparameter selection also has a substantial inﬂuence on the resulting policy’s

performance, particularly when the algorithm leverages deep neural networks.

One popular approach to address this is to learn policies from particular algorithm-hyperparameter

pairs on a training set and then use ofﬂine policy selection, which selects the best policy given a

validation set (Thomas et al.,2015a,2019;Paine et al.,2020;Kumar et al.,2021). However, when

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.08642v2 [cs.LG] 12 Jan 2023

Common Practices Non-Markov

Env

Data

Efﬁcient

(re-train)

Compare

Across

OPL

Considers

Evaluation

Variation

Considers

Training

Variation

Policy selection (1 split)

Internal Objective / TD-Error

(Thomas et al.,2015b,2019)(depends) 7 7 7 7

OPE methods

(Komorowski et al. (2018);

Paine et al. (2020) (depends) 7 3 7 7

OPE + BCa Val.

(Thomas et al.,2015b)(depends) 7 3 3 7

BVFT

(Xie and Jiang,2021)7 7 7 7 7

BVFT + OPE

(Zhang and Jiang,2021)7 7 3 3 7

Q-Function Workﬂow

(Kumar et al.,2021)7 3 7 7 7

Ours: Aiselection (multi-split)

Cross-Validation 3 3 3 3 3

Repeated Random

Subsampling 3 3 3 3 3

Table 1: A summary of commonly used approaches for choosing a deployment policy from a ﬁxed

ofﬂine RL dataset. We deﬁne Data Efﬁcient as: the approach assumes the algorithm can be re-trained

on all data points; (depends) as: depends on whether the underlying OPL or OPE methods make

explicit Markov assumption or not.

the dataset is limited in size, this approach can be limited: (a) if the validation set happens to have

no or very few good/high-reward trajectories, then trained policies cannot be properly evaluated;

(b) if the training set has no or very few such trajectories, then no good policy behavior can be

learned through any policy learning algorithm; and (c) using one ﬁxed training dataset is prone to

overﬁtting the hyperparameters on this one dataset and different hyperparameters could be picked

if the training set changes. One natural solution to this problem is to train on the entire dataset and

compare policy performance on the same dataset, which is often referred to as the internal objective

approach. In Appendix A.1 we conduct a short experiment using D4RL where this approach fails due

to the common issue of Q-value over-estimation (Fujimoto et al.,2019).

There has been much recent interest in providing more robust methods for ofﬂine RL. Many rely on

the workﬂow just discussed, where methods are trained on one dataset and Ofﬂine Policy Evaluation

(OPE) is used to do policy selection (Su et al.,2020;Paine et al.,2020;Zhang and Jiang,2021;Kumar

et al.,2021;Lee et al.,2021;Tang and Wiens,2021;Miyaguchi,2022). Our work highlights the

impact of a less studied issue: the challenge caused by data partitioning variance. We ﬁrst motivate

the need to account for train/validation partition randomness by showing the wide distribution of

OPE scores the same policy can obtain on different subsets of data or the very different performing

policies the same algorithm and hyperparameters can learn depending on different training set

partitions. We also prove a single partition can have a notable failure rate in identifying the best

algorithm-hyperparameter to learn the best policy.

We then introduce a general pipeline for algorithm-hyperparameters (

) selection and policy de-

ployment that: (a) uses repeated random sub-sampling (RRS) with replacement of the dataset to

perform

training, (b) uses OPE on the validation set, (c) computes aggregate statistics over the

RRS splits to inform

selection, and (d) allows to use the selected

to retrain on the entire dataset

to obtain the deployment policy. Though such repeated splitting is common in supervised learning, its

impact and effect have been little studied in the ofﬂine RL framework. Perhaps surprisingly, we show

that our simple pipeline leads to substantial performance improvements in a wide range of popular

benchmark tasks, including D4RL (Fu et al.,2020) and Robomimic (Mandlekar et al.,2021).

2 Related work

Ofﬂine Policy Learning (OPL).

In OPL, the goal is to use historical data from a ﬁxed behavior

policy

πb

to learn a reward-maximizing policy in an unknown environment (Markov Decision Process,

deﬁned in Section 3). Most work studying the sampling complexity and efﬁciency of ofﬂine RL (Xie

and Jiang,2021;Yin et al.,2021) do not depend on the structure of a particular problem, but empirical

performance may vary with some pathological models that are not necessarily Markovian. Shi et al.

(2020) have precisely developed a model selection procedure for testing the Markovian hypothesis

and help explain different performance on different models and MDPs. To address this problem, it is

inherently important to have a fully adaptive characterization in RL because it could save considerable

time in designing domain-speciﬁc RL solutions (Zanette and Brunskill,2019). As an answer to a

variety of problems, OPL is rich with many different methods ranging from policy gradient (Liu et al.,

2019), model-based (Yu et al.,2020;Kidambi et al.,2020), to model-free methods (Siegel et al.,2020;

Fujimoto et al.,2019;Guo et al.,2020;Kumar et al.,2020) each based on different assumptions on

the system dynamics. Practitioners thus dispose of an array of algorithms and corresponding hyperpa-

rameters with no clear consensus on a generally applicable evaluation tool for ofﬂine policy selection.

Ofﬂine Policy Evaluation (OPE).

OPE is concerned with evaluating a target policy’s performance

using only pre-collected historical data generated by other (behavior) policies (Voloshin et al.,2021).

Each of the many OPE estimators has its unique properties, and in this work, we primarily consider

two main variants (Voloshin et al.,2021): Weighted Importance Sampling (WIS) (Precup,2000) and

Fitted Q-Evaluation (FQE) (Le et al.,2019). Both WIS and FQE are sensitive to the partitioning

of the evaluation dataset. WIS is undeﬁned on trajectories where the target policy does not overlap

with the behavior policy and self-normalizes with respect to other trajectories in the dataset. FQE

learns a Q-function using the evaluation dataset. This makes these estimators very different from

mean-squared errors or accuracy in the supervised learning setting – the choice of partitioning will

ﬁrst affect the function approximation in the estimator and then cascade down to the scores they

produce.

Ofﬂine Policy Selection (OPS).

Typically, OPS is approached via OPE, which estimates the expected

return of candidate policies. Zhang and Jiang (2021) address how to improve policy selection in

the ofﬂine RL setting. The algorithm builds on the Batch Value-Function Tournament (BVFT) (Xie

and Jiang,2021) approach to estimating the best value function among a set of candidates using

piece-wise linear value function approximations and selecting the policy with the smallest projected

Bellman error in that space. Previous work on estimator selection for the design of OPE methods

include Su et al. (2020); Miyaguchi (2022) while Kumar et al. (2021); Lee et al. (2021); Tang and

Wiens (2021); Paine et al. (2020) focus on ofﬂine hyperparameter tuning. Kumar et al. (2021) give

recommendations on when to stop training a model to avoid overﬁtting. The approach is exclusively

designed for Q-learning methods with direct access to the internal Q-functions. On the contrary,

our pipeline does policy training, selection, and deployment on any ofﬂine RL method, not reliant

on the Markov assumption, and can select the best policy with potentially no access to the internal

approximation functions (black box). We give a brief overview of some OPS approaches in Table 1.

3 Background and Problem Setting

We deﬁne a stochastic Decision Process

M=hS, A, T, r, γi

, where

is a set of states;

is a set

of actions;

is the transition dynamics (which might depend on the full history);

is the reward

function; and

γ∈(0,1)

is the discount factor. Let

τ={si, ai, s0

i, ri}L

i=0

be the trajectory sampled

from

. The optimal policy

is the one that maximizes the expected discounted return

V(π) = Eτ∼ρπ[G(τ)] where G(τ) = P∞

t=0 γtrtand ρπis the distribution of τunder policy π. For

simplicity, in this paper we assume policies are Markov

π:S→A

, but it is straightforward to

consider policies that are a function of the full history. In an ofﬂine RL problem, we take a dataset:

D={τi}n

i=1

, which can be collected by one or a group of policies which we refer to as the behavior

policy

πb

on the decision process

. The goal in ofﬂine/batch RL is to learn a decision policy

from

a class of policies with the best expected performance

Vπ

for future use. Let

to denote an

pair,

i.e. an ofﬂine policy learning algorithm and its hyperparameters and model architecture. An ofﬂine

policy estimator takes in a policy

πe

and a dataset

, and returns an estimate of its performance:

V: Π × D → R

. In this work, we focus on two popular Ofﬂine Policy Evaluation (OPE) estimators:

Importance Sampling (IS) (Precup,2000) and Fitted Q-Evaluation (FQE) (Le et al.,2019) estimators.

We refer the reader to Voloshin et al. (2021) for a more comprehensive discussion.

(a) (b)

OPE Estimates on 10 Partitions True Reward of Policy trained on 10 Partitions

Figure 1: True performance and evaluation of 6

pairs on the Sepsis-POMDP (N=1000) domain. (a)

shows the OPE estimations and (b) shows the variation in terms of true performance. The variations

are due to the different

pairs of the policies but also to the sensitivity to the training/validation splits.

4 The Challenge of Ofﬂine RL AiSelection

An interesting use-case of ofﬂine RL is when domain experts have access to an existing dataset

(with potentially only a few hundred trajectories) about sequences of decisions made and respective

outcomes, with the hope of leveraging the dataset to learn a better decision policy for future use.

In this setting, the user may want to consider many options regarding the type of RL algorithm

(model-based, model-free, or direct policy search), hyperparameter, or deep network architecture to

use.

Automated algorithm selection is important because different

(different

pairs) may learn

very diverse policies, each with signiﬁcantly different performance

VAi

. Naturally, one can expect

that various algorithms lead to diverse performance, but using a case-study experiment on a sepsis

simulator (Oberst and Sontag,2019), we observe in Figure 1(b) that the sensitivity to hyperparameter

selection is also substantial (cf. different average values in box plots for each method). For example,

MBS-QI (Liu et al.,2020) learns policies ranging from over -12 to -3 in their performance, depending

on the hyperparameters chosen.

Precisely, to address hyperparameter tuning, past work often relies on executing the learned policies

in the simulator/real environment. When this is not feasible, as in many real-world applications,

including our sepsis dataset example, where the user may only be able to leverage existing historical

data, we have no choice but to rely on off-policy evaluation. Prior work (Thomas et al.,2015b;

Farajtabar et al.,2018;Thomas et al.,2019;Mandlekar et al.,2021) have suggested doing so using a

hold-out method, after partitioning the dataset into training and validation sets.

Unfortunately, the partitioning of the dataset itself may result in substantial variability in the training

process (Dietterich,1998). We note that this problem is particularly prominent in ofﬂine RL where

high-reward trajectories are sparse and affect both policy learning and policy evaluation. To explore

this hypothesis, we consider the inﬂuence of the train/validation partition in the same sepsis domain,

and we evaluate the trained policies using the Weighted Importance Sampling (WIS) (Precup,2000)

estimator. Figure 1(a) shows the policies have drastically different OPE estimations with sensitivity

to randomness in the dataset partitioning. We can observe the same phenomena in Figure 1(b) with

largely different true performances depending on the dataset splitting for most of the policies

. This

is also illustrated on the left sub-ﬁgure of Figure 4where in the case where a single train-validation

split is used, an Aithat yields lower-performing policies will often be selected over those that yield

higher-performing policies when deployed.

4.1 Repeated Experiments for Robust Hyperparameter Evaluation in Ofﬂine RL

We now demonstrate why it is important to conduct repeated random sub-sampling on the dataset in

ofﬂine RL. Consider a ﬁnite set of

ofﬂine RL algorithms

. Let the policy produced by algorithm

on training dataset

πj

, its estimated performance on a validation set

Vπj

, and its true

(unknown) value be

Vπj.

Denote the true best resulting policy as

πj∗= arg maxjVπj

and the

corresponding algorithm

Aj∗

. Let the best policy picked based on its validation set performance as

πˆ

j∗= arg maxjˆ

Vπjand the corresponding algorithm Aˆ

j∗.

Theorem 1.

There exist stochastic decision processes and datasets such that (i) using a single

train/validation split procedure that selects an algorithm-hyperparameter with the best performance

on the validation dataset will select a suboptimal policy and algorithm with signiﬁcant ﬁnite prob-

ability,

P(πˆ

j∗6=πj∗)≥C

, with corresponding substantial loss in performance

O(Vmax)

, and, in

contrast, (ii) selecting the algorithm-hyperparameter with the best average validation performance

across

train/validation splits will select the optimal algorithm and policy with probability 1:

limNs→∞ P(πˆ

j∗=πj∗)→1.

Proof Sketch. Due to space constraints we defer the proof to Appendix A.3. Brieﬂy, the proof

proceeds by proof by example through constructing a chain-like stochastic decision process and

considers a class of algorithms that optimize over differing horizons (see e.g. Jiang et al. (2015);

Cheng et al. (2021); Mazoure et al. (2021)). The behavior policy is uniformly random meaning

that trajectories with high rewards are sparse. This means there is a notable probability that in a

single partition of the dataset, the resulting train and/or validation set may not contain a high reward

trajectory, making it impossible to identify that a full horizon algorithm, and resulting policy, is

optimal.

In the proof and our experiments, we focus on when the training and validation sets are of equal

size. If we use an uneven split, such as

80/20%

, the failure probability can further increase if only a

single partition of the dataset is used. We provide an illustrative example in the Appendix. Note that

Leave-one-out Cross-Validation (LooCV) will also fail in our setting if we employ, as we do in our

algorithm, WIS, because as a biased estimator, WIS will return the observed return of the behavior

policy if averaging over a single trajectory, independent of the target policy to be evaluated. We

explain this further in Appendix A.11.

5SSR: Repeated Random Sampling for AiSelection and Deployment

In this paper, we are interested in the following problem: If ofﬂine RL training and evaluation are

very sensitive to the partitioning of the dataset, especially in small data regimes, how can we reliably

produce a ﬁnal policy that we are conﬁdent is better than others and can be reliably deployed in

the real-world?

Instead of considering the sensitivity to data partition as an inherent obstacle for ofﬂine policy

selection, we view this as statistics to leverage for

selection. We propose a general pipeline: Split

Select Retrain (

SSR

) (of which we provide a pseudo-code in Algorithm 1, Appendix A.4) to reliably

optimize for a good deployed policy given only: an ofﬂine dataset, an input set of

pairs and an

off-policy evaluation (OPE) estimator. This deployment approach leverages the random variations

created by dataset partitioning to select algorithms that perform better on average using a robust

hyperparameter evaluation approach which we develop below.

First, we split and create different partitions of the input dataset. For each train/validation split,

each algorithm-hyperparameter (

) is trained on the training set and evaluated using the input

OPE method to yield an estimated value on the validation set. These estimated evaluations are then

averaged, and the best

pair (

A∗

) is selected as the one with the highest average score. Now the

last step of the SSR pipeline is to re-use the entire dataset to train one policy π∗using A∗.

Repeated Random Sub-sampling (RRS).

As Theorem 1suggests, one should ensure a sufﬁcient

amount of trajectories in the evaluation partition to lower the failure rate

. We propose to create RRS

train-validation partitions. This approach has many names in the statistical model selection literature,

such as Predictive Sample Reuse Method (Geisser,1975), Repeated Learning-Test Method (Burman,

1989) or Monte-Carlo Cross-Validation (Dubitzky et al.,2007). It has also been referred to as

Repeated Data Splitting (Chernozhukov et al.,2018) in the heterogeneous treatment effect literature.

We randomly select trajectories in

and put them into into two parts:

Rtrain

and

Rvalid

. We repeat this splitting process

times to generate paired datasets:

(Rtrain

1, Rvalid

1),(Rtrain

2, Rvalid

2), ..., (Rtrain

K, Rvalid

. We compute the generalization performance

estimate as follows:

GA,RSK=1

k=1 ˆ

V(A(Rtrain

k); Rvalid

k)(1)

A key advantage of overlap partitioning is that it maintains the size of the validation dataset as

increases. This might be favorable since OPE estimates are highly dependent on the state-action

coverage of the validation dataset – the more data in the validation dataset, the better OPE estimators

can evaluate a policy’s performance. As

K→ ∞

, RRS approaches the leave-

-out cross-validation

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Data-EfcientPipelineforOfineReinforcementLearningwithLimitedDataAllenNieYannisFlet-BerliacDeonR.JordanWilliamSteenbergenEmmaBrunskillDepartmentofComputerScienceStanfordUniversity*anie@stanford.eduAbstractOfinereinforcementlearning(RL)canbeusedtoimprovefutureperformancebyleveraginghistoricaldata....

展开>> 收起<<

Data-Efﬁcient Pipeline for Ofﬂine Reinforcement Learning with Limited Data Allen NieYannis Flet-Berliac Deon R. Jordan.pdf

共32页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Data-Efﬁcient Pipeline for Ofﬂine Reinforcement Learning with Limited Data Allen NieYannis Flet-Berliac Deon R. Jordan

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: