a policy trained using the same algorithm [1], [12]. In actor-
critic methods, this is due to extrapolation error of the critic
network on out-of-distribution state-action pairs [14]. Offline
RL methods deal with this by constraining the policy to stay
close to the behavioral policy that collected the offline dataset.
BRAC [22] achieves this by minimizing the Kullback-Leibler
divergence between the behavior policy and the learned policy.
BEAR [12] minimizes the MMD distance between the two
policies. TD3+BC [4] proposes a simple yet efficient offline
RL algorithm by adding an additional behavior cloning loss to
the actor update. Another class of offline RL methods learns
conservative Q functions, which prevents the policy network
from exploiting out-of-distribution actions and forces them
to stay close to the behavior policy. CQL [2] changes the
critic objective to also minimize the Q function on unseen
actions. Fisher-BRC [3] achieves conservative Q learning by
constraining the gradient of the Q function on unseen data.
Model-based offline RL methods [23], [24] train policies based
on the data generated by ensembles of dynamics models
learned from offline data, while constraining the policy to
stay within samples where the dynamics model is certain. In
this paper, we focus on offline-to-online RL with the goal
of stable and sample-efficient online fine-tuning from policies
pre-trained on offline datasets of different quality.
Offline pre-training in RL. Pre-training has been vastly
investigated in the machine learning community from com-
puter vision [25]–[27] to natural language processing [28],
[29]. Offline pre-training in RL could enable deployment
of RL methods in domains where data collection can be
expensive or dangerous. [30]–[32] pre-train the policy network
with imitation learning to speed up RL. QT-opt [33] studies
vision-based object manipulation using a diverse and large
dataset collected by seven robots over several months and fine-
tune the policy with 27K samples of online data. However,
these methods pre-train using diverse, large, or expert datasets
and it is also important to investigate the possibility of pre-
training from offline datasets of different quality. [34], [35]
use offline pre-training to accelerate downstream tasks. AWAC
[7] and Balanced Replay [36] are recent works that also
focus on offline-to-online RL from datasets of different quality.
AWAC updates the policy network such that it is constrained
during offline training while not too conservative during fine-
tuning. Balanced Replay trains an additional neural network
to prioritize samples in order to effectively use new data
as well as near-on-policy samples in the offline dataset. We
compare with AWAC and Balanced Replay to attain state-of-
the-art offline-to-online RL performance on the popular D4RL
benchmark.
Ensembles in RL. Ensemble methods are widely used for
better performance in RL [37]–[40]. In model-based RL, PETS
[39] and MBPO [40] use probabilistic ensembles to effectively
model the dynamics of the environment. In model-free RL,
ensembles of Q functions have been shown to improve per-
formance [41], [42]. REDQ [9] learns a randomized ensemble
of Q functions to achieve similar sample efficiency as model-
based methods without learning a dynamic model. We utilize
REDQ in this work for improved sample-efficiency during
online fine-tuning. Specific to offline RL, REM [11] uses
random convex combinations of multiple Q-value estimates
to calculate the Q targets for effective offline RL on Atari
games. MOPO [23] uses probabilistic ensembles from PETS
to learn policies from offline data using uncertainty estimates
based on model disagreement. MBOP [43] uses ensembles
of dynamic models, Q functions, and policy networks to get
better performance on locomotion tasks. Balanced Replay
[36] uses ensembles of pessimistic Q functions to mitigate
instability caused by distribution shift in offline-to-online RL.
While ensembling of Q functions has been studied by several
prior works [9], [42], we combine it with behavioral cloning
loss for the purpose of robust and sample-efficient offline-to-
online RL.
Adaptive balancing of multiple objectives in RL. [44]
train policies using learned dynamics models with the ob-
jective of visiting states that most likely lead to subsequent
improvement in the dynamics model, using active online
learning. They adaptively weigh the maximization of cumu-
lative rewards and minimization of model uncertainty using
an online learning mechanism based on exponential weights
algorithm. In this paper, we focus on offline-to-online RL
using model-free methods and propose to adaptively weigh the
maximization of cumulative rewards and a behavioral cloning
loss. Exploration of other online learning algorithms such as
exponential weights algorithm is a line of future work.
III. BACKGROUND
A. Reinforcement Learning
Reinforcement learning (RL) deals with sequential decision
making to maximize cumulative rewards. RL problems are
often formalized as Markov decision processes (MDPs). An
MDP consists of a set of states S, a set of actions A, a
transition dynamics st+1 ∼p(·| st,at)that represents the
probability of transitioning to a state st+1 by taking action
atin state stat timestep t, a scalar reward function rt=
R(st,at), and a discount factor γ∈[0,1].
A policy function πof an RL agent is a mapping from states
to actions and defines the behavior of the agent. The value
function Vπ(s)of a policy πis defined as the expected cumula-
tive rewards from state s:Vπ(s) = E[P∞
t=0 γtR(st,at)|s0=
s], where the expectation is taken over state transitions
st+1 ∼p(·| st,at)and policy function at∼π(st). Similarly,
the state-action value function Qπ(s,a)is defined as the
expected cumulative rewards after taking action ain state s:
Qπ(s,a) = E[P∞
t=0 γtR(st,at)|s0=s,a0=a]. The goal of
RL is to learn an optimal policy function πθwith parameters
θ, that maximizes the expected cumulative rewards:
πθ= arg maxθEs∼S hQπθ(s, πθ(s))i.
We use the TD3 algorithm for reinforcement learning [8].
TD3 is an actor-critic method that alternatingly trains: (i) the
critic network Qφto estimate the Qπθ(s,a)values of the
policy network πθ, and (ii) the policy network to produce
actions that maximize the Q function: ∇θQφ(s, πθ(s)).