
Reliable Conditioning of Behavioral Cloning
for Offline Reinforcement Learning
Tung Nguyen 1Qinqing Zheng 2Aditya Grover 1
Abstract
Behavioral cloning (BC) provides a straightfor-
ward solution to offline RL by mimicking offline
trajectories via supervised learning. Recent ad-
vances (Chen et al.,2021;Janner et al.,2021;
Emmons et al.,2021) have shown that by condi-
tioning on desired future returns, BC can perform
competitively to their value-based counterparts,
while enjoying much more simplicity and training
stability. While promising, we show that these
methods can be unreliable, as their performance
may degrade significantly when conditioned on
high, out-of-distribution (ood) returns. This is
crucial in practice, as we often expect the pol-
icy to perform better than the offline dataset by
conditioning on an ood value. We show that this
unreliability arises from both the suboptimality of
training data and model architectures. We propose
ConserWeightive Behavioral Cloning (CWBC),
a simple and effective method for improving the
reliability of conditional BC with two key com-
ponents: trajectory weighting and conservative
regularization. Trajectory weighting upweights
the high-return trajectories to reduce the train-test
gap for BC methods, while conservative regular-
izer encourages the policy to stay close to the
data distribution for ood conditioning. We study
CWBC in the context of RvS (Emmons et al.,
2021) and Decision Transformers (Chen et al.,
2021), and show that CWBC significantly boosts
their performance on various benchmarks.
1 Introduction
In many real-world applications such as education, health-
care and autonomous driving, collecting data via online
interactions is expensive or even dangerous. However, we
often have access to logged datasets in these domains that
have been collected previously by some unknown policies.
1
UCLA
2
Meta AI Research. Correspondence to: Tung Nguyen
<tungnd@cs.ucla.edu>.
The goal of offline reinforcement learning (RL) is to directly
learn effective agent policies from such datasets, without
additional online interactions (Lange et al.,2012;Levine
et al.,2020). Many online RL algorithms have been adapted
to work in the offline setting, including value-based methods
(Fujimoto et al.,2019;Ghasemipour et al.,2021;Wu et al.,
2019;Jaques et al.,2019;Kumar et al.,2020;Fujimoto &
Gu,2021;Kostrikov et al.,2021a) as well as model-based
methods (Yu et al.,2020;Kidambi et al.,2020). The key
challenge in all these methods is to generalize the value or
dynamics to state-action pairs outside the offline dataset.
An alternative way to approach offline RL is via approaches
derived from behavioral cloning (BC) (Bain & Sammut,
1995). BC is a supervised learning technique that was ini-
tially developed for imitation learning, where the goal is
to learn a policy that mimics expert demonstrations. Re-
cently, a number of works propose to formulate offline RL
as supervised learning problems (Chen et al.,2021;Jan-
ner et al.,2021;Emmons et al.,2021). Since offline RL
datasets usually do not have expert demonstrations, these
works condition BC on extra context information to spec-
ify target outcomes such as returns and goals. Compared
with the value-based approaches, the empirical evidence
has shown that these conditional BC approaches perform
competitively, and they additionally enjoy the enhanced
simplicity and training stability of supervised learning.
As the maximum return in the offline trajectories is often
far below the desired expert returns, we expect the policy
to extrapolate over the offline data by conditioning on out-
of-distribution (ood) expert returns. In an ideal world, the
policy will achieve the desired outcomes, even when they
are unseen during training. This corresponds to Figure 1a,
where the relationship between the achieved and target re-
turns forms a straight line. In reality, however, the perfor-
mance of current methods is far from ideal. Specifically,
the actual performance closely follows the target return and
peaks at a point near the maximum return in the dataset, but
drops vastly if conditioned on a return beyond that point.
Figure 1b illustrates this problem.
We systematically analyze the unreliability of current meth-
ods, and show that it depends on both the quality of offline
data and the architecture of the return-conditioned policy.
arXiv:2210.05158v2 [cs.LG] 3 Feb 2023