
sampling uniformly from the offline dataset, varying the
sampling rate of offline samples to focus on trajectories
with higher accumulative returns, i.e., changing the action
density, does not change the support and produces a better
behavior policy. Matching the learned policy with a resam-
pled policy is approximately performing support alignment.
In this work, we boost offline RL by designing data rebalanc-
ing strategies to construct better behavior policies. We first
show that existing offline datasets are extremely imbalanced
in terms of episodic return (as shown in Fig. 1). In some
datasets, most actions lead to a low return, which renders
the possibility that current density-based constraints are too
restrictive. We thus propose to resample the dataset dur-
ing training based on episodic return, which assigns larger
weights to transitions with higher returns. The method is
thus dubbed Return-based Data Rebalance (ReD), and can
be easily implemented with less than 10 lines of code. With-
out any modification to prior hyperparameters, we find that
ReD effectively boosts the performance of various popular
offline RL algorithms by a large margin on diverse domains
in D4RL (Brockman et al.,2016;Fu et al.,2020). Then as
our minor contribution, we propose a more elaborated im-
plementation of data rebalance, Decoupled ReD (DeReD),
inspired by decoupling strategies for data rebalance train-
ing in long-tailed classification (Kang et al.,2020). The
proposed DeReD combined with IQL achieves the state-of-
the-art performance on D4RL. The effectiveness of return-
based data rebalance may imply that data dimension is as
important as algorithmic dimension in offline RL.
hopper-medium-replay-v2 walker2d-medium-replay-v2
hopper-medium-expert-v2 walker2d-medium-expert-v2
Figure 1.
Visualization of Trajectory Return Distributions.
Medium-replay datasets are likely to have a long-tailed distribu-
tion, and medium-expert are likely to have two peaks.
2. Related Work
Offline RL.
To alleviate extrapolation error and address
the distributional shift problem, a general framework for
prior offline RL works is to constrain the learned policy
to stay close to the behavior policy. Considering KL-
divergence is easy to accurately compute under the Gaussian
distribution assumption, many works choose KL-divergence
as policy constraint. There are many concrete implemen-
tation choices, e.g., explicitly modeling behavior prior by
VAE, avoiding explicit modeling by the dual form (Wu
et al.,2019;Jaques et al.,2019). Exponentially advantage-
weighted regression, an implicit form of KL-divergence con-
straint, is derived by AWR (Peng et al.,2019), CRR (Wang
et al.,2020) and AWAC (Nair et al.,2020). IQL (Kostrikov
et al.,2021b) also extracts policy via advantage-weighted
regression from the expectile value function, enforcing a
KL constraint. Behavior cloning (BC) is another alternative
to implement constraint (Fujimoto & Gu,2021). There also
exist other directions to realize the offline RL. One line is
to regularize Q-function by conservative estimate (Kumar
et al.,2020;Buckman et al.,2020). Surprisingly, our experi-
ments show that data rebalance also works with conservative
Q-learning (Kumar et al.,2020). Another line is to view
offline RL as the sequential modeling problem by masked
transformer (Chen et al.,2021;Janner et al.,2021), and then
the transformer outputs actions to attain the given return.
Some works attempt to approximately satisfy support align-
ment. BEAR (Kumar et al.,2019) utilizes maximum mean
discrepancy to approximately optimize support alignment.
However, the effectiveness that the sampled MMD has in
constraining two distributions to the same support is only
empirically shown in a low-dimension distribution with di-
agonal covariance matrices with no theoretical guarantee,
and MMD is extremely complex to implement. That’s a
possible reason why Wu et al. finds MMD has no gain
compared to KL. Another attempt to relax the restrictive
constraint is adaptively adjusting the weight of the constraint
term by dual gradient ascent. However, Wu et al. observes
that the adaptive weight has a modest disadvantage over a
fixed one.
Rebalance Data.
Dataset rebalance is widely used in vi-
sual tasks when facing a long-tailed distribution (Zhang
et al.,2021). For decision making, imitation learning (IL)
aims to learn from demonstration, where rebalance is natu-
rally applied to filter out bad demonstrations. BAIL (Chen
et al.,2020) employs a neural network to approximate the
upper envelope (i.e., the optimal return from data) and select
good state-action pairs to imitate. Another rebalance in IL
is 10%BC where behavior cloning only uses the top 10%
of transitions ordered by episode return (Chen et al.,2021).
MARWIL (Wang et al.,2018) and AWR (Peng et al.,2019)
employ exponentially advantage-weighted behavior cloning,
equivalent to policy improvement step with KL constraint in
RL. Experiments show that our method can further improve
KL constraint methods.