Reliable Conditioning of Behavioral Cloning for Offline Reinforcement Learning

2025-04-29 0 0 810.1KB 20 页 10玖币
侵权投诉
Reliable Conditioning of Behavioral Cloning
for Offline Reinforcement Learning
Tung Nguyen 1Qinqing Zheng 2Aditya Grover 1
Abstract
Behavioral cloning (BC) provides a straightfor-
ward solution to offline RL by mimicking offline
trajectories via supervised learning. Recent ad-
vances (Chen et al.,2021;Janner et al.,2021;
Emmons et al.,2021) have shown that by condi-
tioning on desired future returns, BC can perform
competitively to their value-based counterparts,
while enjoying much more simplicity and training
stability. While promising, we show that these
methods can be unreliable, as their performance
may degrade significantly when conditioned on
high, out-of-distribution (ood) returns. This is
crucial in practice, as we often expect the pol-
icy to perform better than the offline dataset by
conditioning on an ood value. We show that this
unreliability arises from both the suboptimality of
training data and model architectures. We propose
ConserWeightive Behavioral Cloning (CWBC),
a simple and effective method for improving the
reliability of conditional BC with two key com-
ponents: trajectory weighting and conservative
regularization. Trajectory weighting upweights
the high-return trajectories to reduce the train-test
gap for BC methods, while conservative regular-
izer encourages the policy to stay close to the
data distribution for ood conditioning. We study
CWBC in the context of RvS (Emmons et al.,
2021) and Decision Transformers (Chen et al.,
2021), and show that CWBC significantly boosts
their performance on various benchmarks.
1 Introduction
In many real-world applications such as education, health-
care and autonomous driving, collecting data via online
interactions is expensive or even dangerous. However, we
often have access to logged datasets in these domains that
have been collected previously by some unknown policies.
1
UCLA
2
Meta AI Research. Correspondence to: Tung Nguyen
<tungnd@cs.ucla.edu>.
The goal of offline reinforcement learning (RL) is to directly
learn effective agent policies from such datasets, without
additional online interactions (Lange et al.,2012;Levine
et al.,2020). Many online RL algorithms have been adapted
to work in the offline setting, including value-based methods
(Fujimoto et al.,2019;Ghasemipour et al.,2021;Wu et al.,
2019;Jaques et al.,2019;Kumar et al.,2020;Fujimoto &
Gu,2021;Kostrikov et al.,2021a) as well as model-based
methods (Yu et al.,2020;Kidambi et al.,2020). The key
challenge in all these methods is to generalize the value or
dynamics to state-action pairs outside the offline dataset.
An alternative way to approach offline RL is via approaches
derived from behavioral cloning (BC) (Bain & Sammut,
1995). BC is a supervised learning technique that was ini-
tially developed for imitation learning, where the goal is
to learn a policy that mimics expert demonstrations. Re-
cently, a number of works propose to formulate offline RL
as supervised learning problems (Chen et al.,2021;Jan-
ner et al.,2021;Emmons et al.,2021). Since offline RL
datasets usually do not have expert demonstrations, these
works condition BC on extra context information to spec-
ify target outcomes such as returns and goals. Compared
with the value-based approaches, the empirical evidence
has shown that these conditional BC approaches perform
competitively, and they additionally enjoy the enhanced
simplicity and training stability of supervised learning.
As the maximum return in the offline trajectories is often
far below the desired expert returns, we expect the policy
to extrapolate over the offline data by conditioning on out-
of-distribution (ood) expert returns. In an ideal world, the
policy will achieve the desired outcomes, even when they
are unseen during training. This corresponds to Figure 1a,
where the relationship between the achieved and target re-
turns forms a straight line. In reality, however, the perfor-
mance of current methods is far from ideal. Specifically,
the actual performance closely follows the target return and
peaks at a point near the maximum return in the dataset, but
drops vastly if conditioned on a return beyond that point.
Figure 1b illustrates this problem.
We systematically analyze the unreliability of current meth-
ods, and show that it depends on both the quality of offline
data and the architecture of the return-conditioned policy.
arXiv:2210.05158v2 [cs.LG] 3 Feb 2023
Reliable Conditioning of Behavioral Cloning for Offline Reinforcement Learning
(a) Ideal (b) Unreliable (c) Reliable
Figure 1: Illustrative figures demonstrating three hypothetical scenarios for conditioning of BC methods for offline RL.
The green line shows the maximum return in the offline dataset, while the orange line shows the expert return. The ideal
scenario (a) is hard or even impossible to achieve with suboptimal offline data. On the other hand, return-conditioned RL
methods can show unreliable generalization (b), where the performance drops quickly after a certain point in the vicinity of
the dataset maximum. Our goal is to ensure reliable generalization (c) even when conditioned on ood returns.
For the former, we observe that offline datasets are generally
suboptimal and even in the range of observed returns, the
distribution is highly non-uniform and concentrated over tra-
jectories with low returns. This affects reliability, as we are
mostly concerned with conditioning the policy on returns
near or above the observed maximum in the offline dataset.
One trivial solution to this problem is to simply filter the
low-return trajectories prior to learning. However, this is
not always viable as filtering can eliminate a good fraction
of the offline trajectories leading to poor data efficiency.
On the architecture aspect, we find that existing BC methods
have significantly different behaviors when conditioning on
ood returns. While DT (Chen et al.,2021) generalizes to
ood returns reliably, RvS (Emmons et al.,2021) is highly
sensitive to such ood conditioning and exhibits vast drops
in peak performance for such ood inputs. Therefore, the
current practice for setting the conditioning return at test
time in RvS is based on careful tuning with online rollouts,
which is often tedious, impractical, and inconsistent with
the promise of offline RL to minimize online interactions.
While the idealized scenario in Figure 1a is hard to achieve
or even impossible depending on the training dataset and
environment (Wang et al.,2020;Zanette,2021;Foster et al.,
2021), the unreliability of these methods is a major barrier
for high-stakes deployments. Hence, we focus this work on
improving the reliability of return-conditioned offline RL
methods. Figure 1c illustrates this goal, where condition-
ing beyond the dataset maximum return does not degrade
the model performance, even if the achieved returns do not
match the target conditioning. To this end, we propose Con-
serWeightive Behavior Cloning (CWBC), which consists of
2 key components: trajectory weighting and conservative
regularization. Trajectory weighting assigns and adjusts
weights to each trajectory during training and prioritizes
high-return trajectories for improved reliability. Next, we in-
troduce a notion of conservatism for ood sensitive BC meth-
ods such as RvS, which encourages the policy to stay close
to the observed state-action distribution when conditioning
on high returns. We achieve conservatism by selectively
perturbing the returns of the high-return trajectories with a
novel noise model and projecting the predicted actions to
the ones observed in the unperturbed trajectory.
Our proposed framework is simple and easy to imple-
ment. Empirically, we instantiate CWBC in the context
of RvS (Emmons et al.,2021) and DT (Chen et al.,2021),
two state-of-the-art BC methods for offline RL. CWBC
significantly improves the performance of RvS and DT in
D4RL (Fu et al.,2020) locomotion tasks by
18%
and
8%
,
respectively, without any hand-picking of the value of the
conditioning returns at test time.
2 Preliminaries
We model our environment as a Markov decision process
(MDP) (Bellman,1957), which can be described by a tuple
M“ xS,A, p, P, R, γy
, where
S
is the state space,
A
is the action space,
pps1q
is the distribution of the initial
state,
Ppst`1|st, atq
is the transition probability distribution,
Rpst, atq
is the deterministic reward function, and
γ
is the
discount factor. At each timestep
t
, the agent observes a
state
stPS
and takes an action
atPA
. This moves the
agent to the next state
st`1Pp¨|st, atq
and provides the
agent with a reward rtRpst, atq.
Offline RL.
We are interested in learning a (near-)optimal
policy from a static offline dataset of trajectories collected by
unknown policies, denoted as
Toffline
. We assume that these
trajectories are i.i.d samples drawn from some unknown
static distribution
T
. We use
τ
to denote a trajectory and
use
|τ|
to denote its length. Following Chen et al. (2021),
the return-to-go (RTG) for a trajectory
τ
at timestep
t
is
defined as the sum of rewards starting from
t
until the end
of the trajectory:
gtř|τ|
t1trt1
. This means the initial RTG
g1is equal to the total return of the trajectory rτř|τ|
t1rt.
Decision Transformer (DT).
DT (Chen et al.,2021) solves
offline RL via sequence modeling. Specifically, DT employs
a transformer architecture that generates actions given a
Reliable Conditioning of Behavioral Cloning for Offline Reinforcement Learning
Figure 2: Reliability of RvS and DT on different
walker2d
datasets. The first row shows the performance of the two
methods, and the second row shows the return distribution of each dataset. Reliability decreases as the data quality decreases
from med-expert to med-replay. While DT performs reliably, RvS exhibits vast drops in performance.
sequence of historical states and RTGs. To do that, DT first
transforms each trajectory in the dataset into a sequence of
returns-to-go, states, and actions:
τ`g1, s1, a1, g2, s2, a2, . . . , g|τ|, s|τ|, a|τ|˘.(1)
DT trains a policy that generates action
at
at each timestep
t
conditioned on the history of RTGs
gt´K:t
, states
st´K:t
,
and actions
at´K:t´1
, wherein
K
is the context length of
the transformer. The objective is a simple mean square error
between the predicted actions and the ground truths:
LDTpθq “ EτT1
|τ|ř|τ|
t1`at´πθpgt´K:t, st´K:t, at´K:t´1q˘2.
(2)
During evaluation, DT starts with an initial state
s1
and a
target RTG
g1
. At each step
t
, the agent generates an action
at
, receives a reward
rt
and observes the next state
st`1
. DT
updates its RTG
gt`1gt´rt
and generates next action
at`1. This process is repeated until the end of the episode.
Reinforcement Learning via Supervised Learning
(RvS).
Emmons et al. (2021) conduct a thorough empir-
ical study of conditional BC methods under the umbrella
of Reinforcement Learning via Supervised Learning (RvS),
and show that even simple models such as multi-layer per-
ceptrons (MLP) can perform well. With carefully chosen
architecture and hyperparameters, they exhibit performance
that matches or exceeds the performance of transformer-
based models. There are two main differences between RvS
and DT. First, RvS conditions on the average reward
ωt
into
the future instead of the sum of future rewards:
ωt1
H´t`1ř|τ|
t1trt1gt
H´t`1,(3)
where
H
is the maximum episode length. Intuitively,
ωt
is
RTG normalized by the number of remaining steps. Second,
RvS employs a simple MLP architecture, which generates
action
at
at step
t
based on only the current state
st
and
expected outcome
ωt
. RvS minimizes a mean square error:
LRvSpθq “ EτT1
|τ|ř|τ|
t1`at´πθpst, ωtq˘2.(4)
At evaluation, RvS is similar to DT, except that the expected
outcome is now updated as ωt`1“ pgt´rtq{pH´tq.
3 Probing Unreliability of BC Methods
Our first goal is to identify factors that influence the reli-
ability of return-conditioned RL methods in practice. To
this end, we design 2 illustrative experiments distinguishing
reliable and unreliable scenarios.
Illustrative Exp 1 (Data centric ablation)
In our first il-
lustrative experiment, we show a run of RvS and DT on
the
med-replay
,
medium
, and
med-expert
datasets
of the
walker2d
environment from the D4RL (Fu et al.,
2020) benchmark. Figure 2shows that reliability (top
row) highly depends on the quality of the dataset (bottom
row). Similar findings hold for other environments as well.
In
medium
and
med-expert
datasets, RvS achieves a
reliable performance when conditioned on high, out-of-
distribution returns, while in the
med-replay
dataset, the
performance drops quickly after a certain point. This is
because the
med-replay
dataset has the lowest quality
among the three, where most trajectories have low returns,
as shown by the second row of Figure 2. This low-quality
data does not provide enough signal for the policy to learn
to condition on high-value returns, thus negatively affecting
the reliability of the model.
Illustrative Exp 2 (Model centric ablation)
Low-quality
data is not the only cause of unreliability, but the architec-
Reliable Conditioning of Behavioral Cloning for Offline Reinforcement Learning
Figure 3: Performance of DT when the state and RTG tokens are concatenated. The results are averaged over 10 seeds.
ture of the model also plays an important role. Figure 2
shows that unlike RvS, DT performs reliably in all three
datasets. We hypothesize that the inherent reliability of DT
comes from the transformer architecture. As the policy con-
ditions on a sequence of both state tokens and RTG tokens
to predict the next action, the attention layers can choose
to ignore the ood RTG tokens while still obtaining a good
prediction loss. In contrast, RvS employs an MLP archi-
tecture that takes both the current state and target return as
inputs to generate actions, and thus cannot ignore the return
information. To test this hypothesis, we experiment with a
slightly modified version of DT, where we concatenate the
state and RTG at each timestep instead of treating them as
separate tokens. By doing this, the model cannot ignore the
RTG information in the sequence. We call this version DT-
Concat. Figure 3shows that the performance of DT-Concat
is strongly correlated with the conditioning RTG, and de-
grades quickly when the target return is out-of-distribution.
This result empirically confirms our hypothesis.
4 Conservative Behavioral Cloning With
Trajectory Weighting
We propose ConserWeightive Behavioral Cloning (CWBC),
a simple but effective framework for improving the reli-
ability of current BC methods. CWBC consists of two
components, namely trajectory weighting and conservative
regularization, which tackle the aforementioned issues relat-
ing to the observed data distribution and the choice of model
architectures, respectively. Trajectory weighting provides a
systematic way to transform the suboptimal data distribution
to better estimate the optimal distribution by upweighting
the high-return trajectories. Moreover, for BC methods such
as RvS which use unreliable model parameterizations, we
propose a novel conservative loss regularizer that encour-
ages the policy to stay close to the data distribution when
conditioned on large, ood returns.
4.1 Trajectory Weighting
To formalize our discussion, recall that
rτ
denotes the return
of a trajectory
τ
and let
rsupτrτ
be the maximum
expert return, which is assumed to be known in prior works
on conditional BC (Chen et al.,2021;Emmons et al.,2021).
We know that the optimal offline data distribution, denoted
by
T
, is simply the distribution of demonstrations rolled
out from the optimal policy. Typically, the offline trajectory
distribution
T
will be biased w.r.t.
T
. During learning,
this leads to a train-test gap, wherein we want to condition
our BC agent on the expert returns during evaluation, but
is forced to minimize the empirical risk on a biased data
distribution during training.
The core idea of our approach is to transform Tinto a new
distribution
r
T
that better estimates
T
. More concretely,
r
T
should concentrate on high-return trajectories, which
mitigates the train-test gap. One naive strategy is to simply
filter out a small fraction of high-return trajectories from
the offline dataset. However, since we expect the original
dataset to contain very few high-return trajectories, this will
eliminate the majority of training data, leading to poor data
efficiency. Instead, we propose to weight the trajectories
based on their returns. Let
fT:RÞÑ R`
be the density
function of
rτ
where
τT
. We consider the transformed
distribution r
Twhose density function pr
Tis
pr
Tpτq 9
trajectory weight
hkkkkkkkkkkkkkkkkikkkkkkkkkkkkkkkkj
fTprτq
fTprτq`λ¨exp `´|rτ´r|
κ˘,(5)
where
λ, κ PR`
are two hyperparameters that determine
the shape of the transformed distribution. Specifically,
κ
controls how much we want to favor the high-return trajecto-
ries, while
λ
controls how close the transformed distribution
is to the original distribution. Appendix C.1 provides a de-
tailed analysis of the influence of these two hyperparameters
on the transformed distribution. Our trajectory weighting is
motivated by a similar scheme proposed for model-based
design (Kumar & Levine,2020), where the authors use it to
balance the bias and variance of gradient approximation for
surrogates to black-box functions. We derive a similar the-
oretical result to this work in Appendix G. However, there
are also notable differences. In model-based design, the
environment is stateless and the dataset consists of
px, yq
pairs, whereas in offline RL we have a dataset of trajectories.
Therefore, our trajectory weighting reweights the entire tra-
jectories by their returns, as opposed to the original work
摘要:

ReliableConditioningofBehavioralCloningforOfineReinforcementLearningTungNguyen1QinqingZheng2AdityaGrover1AbstractBehavioralcloning(BC)providesastraightfor-wardsolutiontoofineRLbymimickingofinetrajectoriesviasupervisedlearning.Recentad-vances(Chenetal.,2021;Janneretal.,2021;Emmonsetal.,2021)havesh...

展开>> 收起<<
Reliable Conditioning of Behavioral Cloning for Offline Reinforcement Learning.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:810.1KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注