Reliable Conditioning of Behavioral Cloning for Offline Reinforcement Learning

2025-04-29 0 0 810.1KB 20 页 10玖币

侵权投诉

Reliable Conditioning of Behavioral Cloning

for Ofﬂine Reinforcement Learning

Tung Nguyen 1Qinqing Zheng 2Aditya Grover 1

Abstract

Behavioral cloning (BC) provides a straightfor-

ward solution to ofﬂine RL by mimicking ofﬂine

trajectories via supervised learning. Recent ad-

vances (Chen et al.,2021;Janner et al.,2021;

Emmons et al.,2021) have shown that by condi-

tioning on desired future returns, BC can perform

competitively to their value-based counterparts,

while enjoying much more simplicity and training

stability. While promising, we show that these

methods can be unreliable, as their performance

may degrade signiﬁcantly when conditioned on

high, out-of-distribution (ood) returns. This is

crucial in practice, as we often expect the pol-

icy to perform better than the ofﬂine dataset by

conditioning on an ood value. We show that this

unreliability arises from both the suboptimality of

training data and model architectures. We propose

ConserWeightive Behavioral Cloning (CWBC),

a simple and effective method for improving the

reliability of conditional BC with two key com-

ponents: trajectory weighting and conservative

regularization. Trajectory weighting upweights

the high-return trajectories to reduce the train-test

gap for BC methods, while conservative regular-

izer encourages the policy to stay close to the

data distribution for ood conditioning. We study

CWBC in the context of RvS (Emmons et al.,

2021) and Decision Transformers (Chen et al.,

2021), and show that CWBC signiﬁcantly boosts

their performance on various benchmarks.

1 Introduction

In many real-world applications such as education, health-

care and autonomous driving, collecting data via online

interactions is expensive or even dangerous. However, we

often have access to logged datasets in these domains that

have been collected previously by some unknown policies.

UCLA

Meta AI Research. Correspondence to: Tung Nguyen

<tungnd@cs.ucla.edu>.

The goal of ofﬂine reinforcement learning (RL) is to directly

learn effective agent policies from such datasets, without

additional online interactions (Lange et al.,2012;Levine

et al.,2020). Many online RL algorithms have been adapted

to work in the ofﬂine setting, including value-based methods

(Fujimoto et al.,2019;Ghasemipour et al.,2021;Wu et al.,

2019;Jaques et al.,2019;Kumar et al.,2020;Fujimoto &

Gu,2021;Kostrikov et al.,2021a) as well as model-based

methods (Yu et al.,2020;Kidambi et al.,2020). The key

challenge in all these methods is to generalize the value or

dynamics to state-action pairs outside the ofﬂine dataset.

An alternative way to approach ofﬂine RL is via approaches

derived from behavioral cloning (BC) (Bain & Sammut,

1995). BC is a supervised learning technique that was ini-

tially developed for imitation learning, where the goal is

to learn a policy that mimics expert demonstrations. Re-

cently, a number of works propose to formulate ofﬂine RL

as supervised learning problems (Chen et al.,2021;Jan-

ner et al.,2021;Emmons et al.,2021). Since ofﬂine RL

datasets usually do not have expert demonstrations, these

works condition BC on extra context information to spec-

ify target outcomes such as returns and goals. Compared

with the value-based approaches, the empirical evidence

has shown that these conditional BC approaches perform

competitively, and they additionally enjoy the enhanced

simplicity and training stability of supervised learning.

As the maximum return in the ofﬂine trajectories is often

far below the desired expert returns, we expect the policy

to extrapolate over the ofﬂine data by conditioning on out-

of-distribution (ood) expert returns. In an ideal world, the

policy will achieve the desired outcomes, even when they

are unseen during training. This corresponds to Figure 1a,

where the relationship between the achieved and target re-

turns forms a straight line. In reality, however, the perfor-

mance of current methods is far from ideal. Speciﬁcally,

the actual performance closely follows the target return and

peaks at a point near the maximum return in the dataset, but

drops vastly if conditioned on a return beyond that point.

Figure 1b illustrates this problem.

We systematically analyze the unreliability of current meth-

ods, and show that it depends on both the quality of ofﬂine

data and the architecture of the return-conditioned policy.

arXiv:2210.05158v2 [cs.LG] 3 Feb 2023

Reliable Conditioning of Behavioral Cloning for Ofﬂine Reinforcement Learning

(a) Ideal (b) Unreliable (c) Reliable

Figure 1: Illustrative ﬁgures demonstrating three hypothetical scenarios for conditioning of BC methods for ofﬂine RL.

The green line shows the maximum return in the ofﬂine dataset, while the orange line shows the expert return. The ideal

scenario (a) is hard or even impossible to achieve with suboptimal ofﬂine data. On the other hand, return-conditioned RL

methods can show unreliable generalization (b), where the performance drops quickly after a certain point in the vicinity of

the dataset maximum. Our goal is to ensure reliable generalization (c) even when conditioned on ood returns.

For the former, we observe that ofﬂine datasets are generally

suboptimal and even in the range of observed returns, the

distribution is highly non-uniform and concentrated over tra-

jectories with low returns. This affects reliability, as we are

mostly concerned with conditioning the policy on returns

near or above the observed maximum in the ofﬂine dataset.

One trivial solution to this problem is to simply ﬁlter the

low-return trajectories prior to learning. However, this is

not always viable as ﬁltering can eliminate a good fraction

of the ofﬂine trajectories leading to poor data efﬁciency.

On the architecture aspect, we ﬁnd that existing BC methods

have signiﬁcantly different behaviors when conditioning on

ood returns. While DT (Chen et al.,2021) generalizes to

ood returns reliably, RvS (Emmons et al.,2021) is highly

sensitive to such ood conditioning and exhibits vast drops

in peak performance for such ood inputs. Therefore, the

current practice for setting the conditioning return at test

time in RvS is based on careful tuning with online rollouts,

which is often tedious, impractical, and inconsistent with

the promise of ofﬂine RL to minimize online interactions.

While the idealized scenario in Figure 1a is hard to achieve

or even impossible depending on the training dataset and

environment (Wang et al.,2020;Zanette,2021;Foster et al.,

2021), the unreliability of these methods is a major barrier

for high-stakes deployments. Hence, we focus this work on

improving the reliability of return-conditioned ofﬂine RL

methods. Figure 1c illustrates this goal, where condition-

ing beyond the dataset maximum return does not degrade

the model performance, even if the achieved returns do not

match the target conditioning. To this end, we propose Con-

serWeightive Behavior Cloning (CWBC), which consists of

2 key components: trajectory weighting and conservative

regularization. Trajectory weighting assigns and adjusts

weights to each trajectory during training and prioritizes

high-return trajectories for improved reliability. Next, we in-

troduce a notion of conservatism for ood sensitive BC meth-

ods such as RvS, which encourages the policy to stay close

to the observed state-action distribution when conditioning

on high returns. We achieve conservatism by selectively

perturbing the returns of the high-return trajectories with a

novel noise model and projecting the predicted actions to

the ones observed in the unperturbed trajectory.

Our proposed framework is simple and easy to imple-

ment. Empirically, we instantiate CWBC in the context

of RvS (Emmons et al.,2021) and DT (Chen et al.,2021),

two state-of-the-art BC methods for ofﬂine RL. CWBC

signiﬁcantly improves the performance of RvS and DT in

D4RL (Fu et al.,2020) locomotion tasks by

18%

and

respectively, without any hand-picking of the value of the

conditioning returns at test time.

2 Preliminaries

We model our environment as a Markov decision process

(MDP) (Bellman,1957), which can be described by a tuple

M“ xS,A, p, P, R, γy

, where

is the state space,

is the action space,

pps1q

is the distribution of the initial

state,

Ppst`1|st, atq

is the transition probability distribution,

Rpst, atq

is the deterministic reward function, and

is the

discount factor. At each timestep

, the agent observes a

state

stPS

and takes an action

atPA

. This moves the

agent to the next state

st`1„Pp¨|st, atq

and provides the

agent with a reward rt“Rpst, atq.

Ofﬂine RL.

We are interested in learning a (near-)optimal

policy from a static ofﬂine dataset of trajectories collected by

unknown policies, denoted as

Tofﬂine

. We assume that these

trajectories are i.i.d samples drawn from some unknown

static distribution

. We use

to denote a trajectory and

use

|τ|

to denote its length. Following Chen et al. (2021),

the return-to-go (RTG) for a trajectory

at timestep

deﬁned as the sum of rewards starting from

until the end

of the trajectory:

gt“ř|τ|

t1“trt1

. This means the initial RTG

g1is equal to the total return of the trajectory rτ“ř|τ|

t“1rt.

Decision Transformer (DT).

DT (Chen et al.,2021) solves

ofﬂine RL via sequence modeling. Speciﬁcally, DT employs

a transformer architecture that generates actions given a

Reliable Conditioning of Behavioral Cloning for Ofﬂine Reinforcement Learning

Figure 2: Reliability of RvS and DT on different

walker2d

datasets. The ﬁrst row shows the performance of the two

methods, and the second row shows the return distribution of each dataset. Reliability decreases as the data quality decreases

from med-expert to med-replay. While DT performs reliably, RvS exhibits vast drops in performance.

sequence of historical states and RTGs. To do that, DT ﬁrst

transforms each trajectory in the dataset into a sequence of

returns-to-go, states, and actions:

τ“`g1, s1, a1, g2, s2, a2, . . . , g|τ|, s|τ|, a|τ|˘.(1)

DT trains a policy that generates action

at each timestep

conditioned on the history of RTGs

gt´K:t

, states

st´K:t

and actions

at´K:t´1

, wherein

is the context length of

the transformer. The objective is a simple mean square error

between the predicted actions and the ground truths:

LDTpθq “ Eτ„T“1

|τ|ř|τ|

t“1`at´πθpgt´K:t, st´K:t, at´K:t´1q˘2‰.

(2)

During evaluation, DT starts with an initial state

and a

target RTG

. At each step

, the agent generates an action

, receives a reward

and observes the next state

st`1

. DT

updates its RTG

gt`1“gt´rt

and generates next action

at`1. This process is repeated until the end of the episode.

Reinforcement Learning via Supervised Learning

(RvS).

Emmons et al. (2021) conduct a thorough empir-

ical study of conditional BC methods under the umbrella

of Reinforcement Learning via Supervised Learning (RvS),

and show that even simple models such as multi-layer per-

ceptrons (MLP) can perform well. With carefully chosen

architecture and hyperparameters, they exhibit performance

that matches or exceeds the performance of transformer-

based models. There are two main differences between RvS

and DT. First, RvS conditions on the average reward

ωt

into

the future instead of the sum of future rewards:

ωt“1

H´t`1ř|τ|

t1“trt1“gt

H´t`1,(3)

where

is the maximum episode length. Intuitively,

ωt

RTG normalized by the number of remaining steps. Second,

RvS employs a simple MLP architecture, which generates

action

at step

based on only the current state

and

expected outcome

ωt

. RvS minimizes a mean square error:

LRvSpθq “ Eτ„T“1

|τ|ř|τ|

t“1`at´πθpst, ωtq˘2‰.(4)

At evaluation, RvS is similar to DT, except that the expected

outcome is now updated as ωt`1“ pgt´rtq{pH´tq.

3 Probing Unreliability of BC Methods

Our ﬁrst goal is to identify factors that inﬂuence the reli-

ability of return-conditioned RL methods in practice. To

this end, we design 2 illustrative experiments distinguishing

reliable and unreliable scenarios.

Illustrative Exp 1 (Data centric ablation)

In our ﬁrst il-

lustrative experiment, we show a run of RvS and DT on

the

med-replay

medium

, and

med-expert

datasets

of the

walker2d

environment from the D4RL (Fu et al.,

2020) benchmark. Figure 2shows that reliability (top

row) highly depends on the quality of the dataset (bottom

row). Similar ﬁndings hold for other environments as well.

medium

and

med-expert

datasets, RvS achieves a

reliable performance when conditioned on high, out-of-

distribution returns, while in the

med-replay

dataset, the

performance drops quickly after a certain point. This is

because the

med-replay

dataset has the lowest quality

among the three, where most trajectories have low returns,

as shown by the second row of Figure 2. This low-quality

data does not provide enough signal for the policy to learn

to condition on high-value returns, thus negatively affecting

the reliability of the model.

Illustrative Exp 2 (Model centric ablation)

Low-quality

data is not the only cause of unreliability, but the architec-

Reliable Conditioning of Behavioral Cloning for Ofﬂine Reinforcement Learning

Figure 3: Performance of DT when the state and RTG tokens are concatenated. The results are averaged over 10 seeds.

ture of the model also plays an important role. Figure 2

shows that unlike RvS, DT performs reliably in all three

datasets. We hypothesize that the inherent reliability of DT

comes from the transformer architecture. As the policy con-

ditions on a sequence of both state tokens and RTG tokens

to predict the next action, the attention layers can choose

to ignore the ood RTG tokens while still obtaining a good

prediction loss. In contrast, RvS employs an MLP archi-

tecture that takes both the current state and target return as

inputs to generate actions, and thus cannot ignore the return

information. To test this hypothesis, we experiment with a

slightly modiﬁed version of DT, where we concatenate the

state and RTG at each timestep instead of treating them as

separate tokens. By doing this, the model cannot ignore the

RTG information in the sequence. We call this version DT-

Concat. Figure 3shows that the performance of DT-Concat

is strongly correlated with the conditioning RTG, and de-

grades quickly when the target return is out-of-distribution.

This result empirically conﬁrms our hypothesis.

4 Conservative Behavioral Cloning With

Trajectory Weighting

We propose ConserWeightive Behavioral Cloning (CWBC),

a simple but effective framework for improving the reli-

ability of current BC methods. CWBC consists of two

components, namely trajectory weighting and conservative

regularization, which tackle the aforementioned issues relat-

ing to the observed data distribution and the choice of model

architectures, respectively. Trajectory weighting provides a

systematic way to transform the suboptimal data distribution

to better estimate the optimal distribution by upweighting

the high-return trajectories. Moreover, for BC methods such

as RvS which use unreliable model parameterizations, we

propose a novel conservative loss regularizer that encour-

ages the policy to stay close to the data distribution when

conditioned on large, ood returns.

4.1 Trajectory Weighting

To formalize our discussion, recall that

rτ

denotes the return

of a trajectory

and let

r‹“supτrτ

be the maximum

expert return, which is assumed to be known in prior works

on conditional BC (Chen et al.,2021;Emmons et al.,2021).

We know that the optimal ofﬂine data distribution, denoted

T‹

, is simply the distribution of demonstrations rolled

out from the optimal policy. Typically, the ofﬂine trajectory

distribution

will be biased w.r.t.

T‹

. During learning,

this leads to a train-test gap, wherein we want to condition

our BC agent on the expert returns during evaluation, but

is forced to minimize the empirical risk on a biased data

distribution during training.

The core idea of our approach is to transform Tinto a new

distribution

that better estimates

T‹

. More concretely,

should concentrate on high-return trajectories, which

mitigates the train-test gap. One naive strategy is to simply

ﬁlter out a small fraction of high-return trajectories from

the ofﬂine dataset. However, since we expect the original

dataset to contain very few high-return trajectories, this will

eliminate the majority of training data, leading to poor data

efﬁciency. Instead, we propose to weight the trajectories

based on their returns. Let

fT:RÞÑ R`

be the density

function of

rτ

where

τ„T

. We consider the transformed

distribution r

Twhose density function pr

Tis

Tpτq 9

trajectory weight

hkkkkkkkkkkkkkkkkikkkkkkkkkkkkkkkkj

fTprτq

fTprτq`λ¨exp `´|rτ´r‹|

κ˘,(5)

where

λ, κ PR`

are two hyperparameters that determine

the shape of the transformed distribution. Speciﬁcally,

controls how much we want to favor the high-return trajecto-

ries, while

controls how close the transformed distribution

is to the original distribution. Appendix C.1 provides a de-

tailed analysis of the inﬂuence of these two hyperparameters

on the transformed distribution. Our trajectory weighting is

motivated by a similar scheme proposed for model-based

design (Kumar & Levine,2020), where the authors use it to

balance the bias and variance of gradient approximation for

surrogates to black-box functions. We derive a similar the-

oretical result to this work in Appendix G. However, there

are also notable differences. In model-based design, the

environment is stateless and the dataset consists of

px, yq

pairs, whereas in ofﬂine RL we have a dataset of trajectories.

Therefore, our trajectory weighting reweights the entire tra-

jectories by their returns, as opposed to the original work

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ReliableConditioningofBehavioralCloningforOfineReinforcementLearningTungNguyen1QinqingZheng2AdityaGrover1AbstractBehavioralcloning(BC)providesastraightfor-wardsolutiontoofineRLbymimickingofinetrajectoriesviasupervisedlearning.Recentad-vances(Chenetal.,2021;Janneretal.,2021;Emmonsetal.,2021)havesh...

展开>> 收起<<

Reliable Conditioning of Behavioral Cloning for Offline Reinforcement Learning.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Reliable Conditioning of Behavioral Cloning for Offline Reinforcement Learning

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: