
Experiment Tracking
Offline RL evaluation is another challenging aspect of the current offline RL state (Kurenkov &
Kolesnikov, 2022). To face this uncertainty, CORL supports integration with Weights&Biases
(Biewald, 2020), a modern experiment tracking tool. With each experiment, CORL automatically
saves (1) source code, (2) dependencies (requirements.txt), (3) hardware setup, (4) OS environment
variables, (5) hyperparameters, (6) training, and system metrics, (7) logs (stdout, stderr). See
Appendix B for an example.
Although, Weights&Biases is a proprietary solution, other alternatives, such as Tensorboard (Abadi
et al., 2015) or Aim (Arakelyan et al., 2020), could be used within a few lines of code change. It is
also important to note that with Weights&Biases tracking, one could easily use CORL with sweeps
or public reports.
We found full metrics tracking during the training process necessary for two reasons. First, it removes
the possible bias of the final or best performance commonly reported in papers. For example, one
could evaluate offline RL performance as max archived score, while another uses the average scores
over
N
(last) evaluations (Takuma Seno, 2021). Second, it provides an opportunity for advanced
performance analysis such as EOP (Kurenkov & Kolesnikov, 2022) or RLiable (Agarwal et al., 2021).
In short, when provided with all metrics logs, one can utilize all performance statistics, not merely
relying on commonly used alternatives.
4 Benchmarking D4RL
4.1 Offline
In our library, we implemented the following offline algorithms:
N%7
Behavioral Cloning (BC), TD3
+ BC (Fujimoto & Gu, 2021), CQL (Kumar et al., 2020), IQL (Kostrikov et al., 2021), AWAC (Nair
et al., 2020), ReBRAC (Tarasov et al., 2023), SAC-N, EDAC (An et al., 2021), and Decision
Transformer (DT) (Chen et al., 2021). We evaluated every algorithm on the D4RL benchmark (Fu
et al., 2020), focusing on Gym-MuJoCo, Maze2D, AntMaze, and Adroit tasks. Each algorithm was
run for one million gradient steps
8
and evaluated using ten episodes for Gym-MuJoCo and Adroit
tasks. For Maze2d, we use 100 evaluation episodes. In our experiments, we tried to rely on the
hyperparameters proposed in the original works (see Appendix D for details) as much as possible.
The final performance is reported in Table 1 and the maximal performance in Table 2. The scores
are normalized to the range between 0 and 100 (Fu et al., 2020). Following the recent work by
Takuma Seno (2021), we report the last and best-obtained scores to illustrate each algorithm’s potential
performance and overfitting properties. Figure 2 shows the performance profiles and probability of
improvement of ReBRAC over other algorithms (Agarwal et al., 2021). See Appendix A for complete
training performance graphs.
Based on these results, we make several valuable observations. First, ReBRAC, IQL and AWAC are
the most competitive baselines in offline setup on average. Note that AWAC is often omitted in recent
works.
Observation 1: ReBRAC, IQL and AWAC are the strongest offline baselines on average.
Second, EDAC outperforms all other algorithms on Gym-MuJoCo by a significant margin, and to our
prior knowledge, there are still no algorithms that perform much better on these tasks. SAC-N shows
the best performance on Maze2d tasks. However, simultaneously, SAC-N and EDAC cannot solve
AntMaze tasks and perform poorly in the Adroit domain.
Observation 2: SAC-N and EDAC are the strongest baselines for Gym-MuJoCo and Maze2d,
but they perform poorly on both AntMaze and Adroit domains.
Third, during our experiments, we observed that the hyperparameters proposed for CQL in Kumar
et al. (2020) do not perform as well as claimed on most tasks. CQL is extremely sensitive to the
7N
is a percentage of best trajectories with the highest return used for training. We omit the percentage when
it is equal to 100.
8Except SAC-N, EDAC, and DT due to their original hyperparameters. See Appendix D for details.
4