CORL Research-oriented Deep Offline Reinforcement Learning Library Denis Tarasov

2025-05-06 0 0 3.99MB 24 页 10玖币

侵权投诉

CORL: Research-oriented Deep Ofﬂine Reinforcement

Learning Library

Denis Tarasov

Tinkoff

den.tarasov@tinkoff.ai

Alexander Nikulin

Tinkoff

a.p.nikulin@tinkoff.ai

Dmitry Akimov

Tinkoff

d.akimov@tinkoff.ai

Vladislav Kurenkov

Tinkoff

v.kurenkov@tinkoff.ai

Sergey Kolesnikov

Tinkoff

s.s.kolesnikov@tinkoff.ai

Abstract

CORL

is an open-source library that provides thoroughly benchmarked single-ﬁle

implementations of both deep ofﬂine and ofﬂine-to-online reinforcement learning

algorithms. It emphasizes a simple developing experience with a straightforward

codebase and a modern analysis tracking tool. In CORL, we isolate methods

implementation into separate single ﬁles, making performance-relevant details

easier to recognize. Additionally, an experiment tracking feature is available to

help log metrics, hyperparameters, dependencies, and more to the cloud. Finally,

we have ensured the reliability of the implementations by benchmarking commonly

employed D4RL datasets providing a transparent source of results that can be

reused for robust evaluation tools such as performance proﬁles, probability of

improvement, or expected online performance.

1 Introduction

Deep Ofﬂine Reinforcement Learning (Levine et al., 2020) has been showing signiﬁcant advancements

in numerous domains such as robotics (Smith et al., 2022; Kumar et al., 2021), autonomous driving

(Diehl et al., 2021) and recommender systems (Chen et al., 2022). Due to such rapid development,

many open-source ofﬂine RL solutions

emerged to help RL practitioners understand and improve

well-known ofﬂine RL techniques in different ﬁelds. On the one hand, they introduce ofﬂine RL

algorithms standard interfaces and user-friendly APIs, simplifying ofﬂine RL methods incorporation

into existing projects. On the other hand, introduced abstractions may hinder the learning curve for

newcomers and the ease of adoption for researchers interested in developing new algorithms. One

needs to understand the modularity design (several ﬁles on average), which (1) can be comprised of

thousands of lines of code or (2) can hardly ﬁt for a novel method3.

In this technical report, we take a different perspective on an ofﬂine RL library and also incorporate

emerging interest in the ofﬂine-to-online setup. We propose CORL (Clean Ofﬂine Reinforcement

Learning) – minimalistic and isolated single-ﬁle implementations of deep ofﬂine and ofﬂine-to-online

RL algorithms, supported by open-sourced D4RL (Fu et al., 2020) benchmark results. The uncom-

plicated design allows practitioners to read and understand the implementations of the algorithms

straightforwardly. Moreover, CORL supports optional integration with experiments tracking tools

such as Weighs&Biases (Biewald, 2020), providing practitioners with a convenient way to analyze

1CORL Repository: https://github.com/corl-team/CORL

2https://github.com/hanjuku-kaso/awesome-offline-rl#oss

3https://github.com/takuseno/d3rlpy/issues/141

37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks.

arXiv:2210.07105v4 [cs.LG] 26 Oct 2023

the results and behavior of all algorithms, not merely relying on a ﬁnal performance commonly

reported in papers.

We hope that the CORL library will help ofﬂine RL newcomers study implemented algorithms and

aid the researchers in quickly modifying existing methods without ﬁghting through different levels

of abstraction. Finally, the obtained results may serve as a reference point for D4RL benchmarks

avoiding the need to re-implement and tune existing algorithms’ hyperparameters.

Ya Cgai Fie Sige- Fie eeai Eeie Tackig Lg

eie aa

agih aa

AWAC / BC / CQL / DT

EDAC / QL / SAC- N / TD3+BC

h d. -- cg=cfg/d- he.a -- gdi=g/d- he -- - ech=50

Wadb g

Wadb e

Figure 1: The illustration of the CORL library design. Single-ﬁle implementation takes a yaml

conﬁguration ﬁle with both environment and algorithm parameters to run the experiment, which logs

all required statistics to Weights&Biases (Biewald, 2020).

2 Related Work

Since the Atari breakthrough (Mnih et al., 2015), numerous open-source RL frameworks and libraries

have been developed over the last years: (Dhariwal et al., 2017; Hill et al., 2018; Castro et al., 2018;

Gauci et al., 2018; Keng & Graesser, 2017; garage contributors, 2019; Duan et al., 2016; Kolesnikov

& Hrinchuk, 2019; Fujita et al., 2021; Liang et al., 2018; Fujita et al., 2021; Liu et al., 2021; Huang

et al., 2021; Weng et al., 2021; Stooke & Abbeel, 2019), focusing on different perspectives of

the RL. For example, stable-baselines (Hill et al., 2018) provides many deep RL implementations

that carefully reproduce results to back up RL practitioners with reliable baselines during methods

comparison. On the other hand, Ray (Liang et al., 2018) focuses on implementations scalability and

production-friendly usage. Finally, more nuanced solutions exist, such as Dopamine (Castro et al.,

2018), which emphasizes different DQN variants, or ReAgent (Gauci et al., 2018), which applies RL

to the RecSys domain.

At the same time, the ofﬂine RL branch and especially ofﬂine-to-online, which we are interested

in this paper, are not yet covered as much: the only library that precisely focuses on ofﬂine RL

setting is d3rlpy (Takuma Seno, 2021). While CORL also focuses on ofﬂine RL methods (Nair

et al., 2020; Kumar et al., 2020; Kostrikov et al., 2021; Fujimoto & Gu, 2021; An et al., 2021;

Chen et al., 2021), similar to d3rlpy, it takes a different perspective on library design and provides

non-modular independent algorithms implementations. More precisely, CORL does not introduce

additional abstractions to make ofﬂine RL more general but instead gives an "easy-to-hack" starter

kit for research needs. Finally, CORL also provides recent ofﬂine-to-online solutions (Nair et al.,

2020; Kumar et al., 2020; Kostrikov et al., 2021; Wu et al., 2022; Nakamoto et al., 2023; Tarasov

et al., 2023) that are gaining interest among researchers and practitioners.

Although CORL does not represent the ﬁrst non-modular RL library, which is more likely the CleanRL

(Huang et al., 2021) case, it has two signiﬁcant differences from its predecessor. First, CORL is

focused on ofﬂine and ofﬂine-to-online RL, while CleanRL implements online RL algorithms. Second,

CORL intends to minimize the complexity of the requirements and external dependencies. To be more

concrete, CORL does not have additional requirements with abstractions such as

stable

baselines

(Hill et al., 2018) or

envpool

(Weng et al., 2022) but instead implements everything from scratch in

the codebase.

3 CORL Design

Single-File Implementations

Implementational subtleties signiﬁcantly impact agent performance in deep RL (Henderson et al.,

2018; Engstrom et al., 2020; Fujimoto & Gu, 2021). Unfortunately, user-friendly abstractions and

general interfaces, the core idea behind modular libraries, encapsulate and often hide these important

nuances from the practitioners. For such a reason, CORL unwraps these details by adopting single-ﬁle

implementations. To be more concrete, we put environment details, algorithms hyperparameters, and

evaluation parameters into a single ﬁle4. For example, we provide

•any_percent_bc.py

(404 LOC

) as a baseline algorithm for ofﬂine RL methods comparison,

•td3_bc.py

(511 LOC) as a competitive minimalistic ofﬂine RL algorithm (Fujimoto & Gu,

2021),

•dt.py

(540 LOC) as an example of the recently proposed trajectory optimization ap-

proach (Chen et al., 2021)

Figure 1 depicts an overall library design. To avoid over-complicated ofﬂine implementations, we treat

ofﬂine and ofﬂine-to-online versions of the same algorithms separately. While such design produces

code duplications among realization, it has several essential beneﬁts from the both educational and

research perspective:

•

Smooth learning curve. Having the entire code in one place makes understanding all its

aspects more straightforward. In other words, one may ﬁnd it easier to dive into 540 LOC of

single-ﬁle Decision Transformer (Chen et al., 2021) implementation rather than 10+ ﬁles of

the original implementation6.

•

Simple prototyping. As we are not interested in the code’s general applicability, we could

make it implementation-speciﬁc. Such a design also removes the need for inheritance from

general primitives or their refactoring, reducing abstraction overhead to zero. At the same

time, this idea gives us complete freedom during code modiﬁcation.

•

Faster debugging. Without additional abstractions, implementation simpliﬁes to a single

for-loop with a global Python name scope. Furthermore, such ﬂat architecture makes

accessing and inspecting any created variable easier during training, which is crucial in the

presence of modiﬁcations and debugging.

Conﬁguration ﬁles

Although it is a typical pattern to use a command line interface (CLI) for single-ﬁle experiments in the

research community, CORL slightly improves it with predeﬁned conﬁguration ﬁles. Utilizing YAML

parsing through CLI, for each experiment, we gather all environment and algorithm hyperparameters

into such ﬁles so that one can use them as an initial setup. We found that such setup (1) simpliﬁes

experiments, eliminating the need to keep all algorithm-environment-speciﬁc parameters in mind,

and (2) keeps it convenient with the familiar CLI approach.

4We follow the PEP8 style guide with a maximum line length of 89, which increases LOC a bit.

5Lines Of Code

6Original Decision Transformer implementation: https://github.com/kzl/decision-transformer

Experiment Tracking

Ofﬂine RL evaluation is another challenging aspect of the current ofﬂine RL state (Kurenkov &

Kolesnikov, 2022). To face this uncertainty, CORL supports integration with Weights&Biases

(Biewald, 2020), a modern experiment tracking tool. With each experiment, CORL automatically

saves (1) source code, (2) dependencies (requirements.txt), (3) hardware setup, (4) OS environment

variables, (5) hyperparameters, (6) training, and system metrics, (7) logs (stdout, stderr). See

Appendix B for an example.

Although, Weights&Biases is a proprietary solution, other alternatives, such as Tensorboard (Abadi

et al., 2015) or Aim (Arakelyan et al., 2020), could be used within a few lines of code change. It is

also important to note that with Weights&Biases tracking, one could easily use CORL with sweeps

or public reports.

We found full metrics tracking during the training process necessary for two reasons. First, it removes

the possible bias of the ﬁnal or best performance commonly reported in papers. For example, one

could evaluate ofﬂine RL performance as max archived score, while another uses the average scores

over

(last) evaluations (Takuma Seno, 2021). Second, it provides an opportunity for advanced

performance analysis such as EOP (Kurenkov & Kolesnikov, 2022) or RLiable (Agarwal et al., 2021).

In short, when provided with all metrics logs, one can utilize all performance statistics, not merely

relying on commonly used alternatives.

4 Benchmarking D4RL

4.1 Ofﬂine

In our library, we implemented the following ofﬂine algorithms:

N%7

Behavioral Cloning (BC), TD3

+ BC (Fujimoto & Gu, 2021), CQL (Kumar et al., 2020), IQL (Kostrikov et al., 2021), AWAC (Nair

et al., 2020), ReBRAC (Tarasov et al., 2023), SAC-N, EDAC (An et al., 2021), and Decision

Transformer (DT) (Chen et al., 2021). We evaluated every algorithm on the D4RL benchmark (Fu

et al., 2020), focusing on Gym-MuJoCo, Maze2D, AntMaze, and Adroit tasks. Each algorithm was

run for one million gradient steps

and evaluated using ten episodes for Gym-MuJoCo and Adroit

tasks. For Maze2d, we use 100 evaluation episodes. In our experiments, we tried to rely on the

hyperparameters proposed in the original works (see Appendix D for details) as much as possible.

The ﬁnal performance is reported in Table 1 and the maximal performance in Table 2. The scores

are normalized to the range between 0 and 100 (Fu et al., 2020). Following the recent work by

Takuma Seno (2021), we report the last and best-obtained scores to illustrate each algorithm’s potential

performance and overﬁtting properties. Figure 2 shows the performance proﬁles and probability of

improvement of ReBRAC over other algorithms (Agarwal et al., 2021). See Appendix A for complete

training performance graphs.

Based on these results, we make several valuable observations. First, ReBRAC, IQL and AWAC are

the most competitive baselines in ofﬂine setup on average. Note that AWAC is often omitted in recent

works.

Observation 1: ReBRAC, IQL and AWAC are the strongest ofﬂine baselines on average.

Second, EDAC outperforms all other algorithms on Gym-MuJoCo by a signiﬁcant margin, and to our

prior knowledge, there are still no algorithms that perform much better on these tasks. SAC-N shows

the best performance on Maze2d tasks. However, simultaneously, SAC-N and EDAC cannot solve

AntMaze tasks and perform poorly in the Adroit domain.

Observation 2: SAC-N and EDAC are the strongest baselines for Gym-MuJoCo and Maze2d,

but they perform poorly on both AntMaze and Adroit domains.

Third, during our experiments, we observed that the hyperparameters proposed for CQL in Kumar

et al. (2020) do not perform as well as claimed on most tasks. CQL is extremely sensitive to the

is a percentage of best trajectories with the highest return used for training. We omit the percentage when

it is equal to 100.

8Except SAC-N, EDAC, and DT due to their original hyperparameters. See Appendix D for details.

(a) (b)

Figure 2: (a) Performance proﬁles after ofﬂine training (b) Probability of improvement of ReBRAC

to other algorithms after ofﬂine training. The curves (Agarwal et al., 2021) are for D4RL benchmark

spanning Gym-MuJoCo, Maze2d, AntMaze, and Adroit datasets.

Table 1: Normalized performance of the last trained policy on D4RL averaged over 4 random seeds.

Task Name BC BC-10% TD3+BC AWAC CQL IQL ReBRAC SAC-NEDAC DT

halfcheetah-medium-v2 42.40 ±0.19 42.46 ±0.70 48.10 ±0.18 50.02 ±0.27 47.04 ±0.22 48.31 ±0.22 64.04 ±0.68 68.20 ±1.28 67.70 ±1.04 42.20 ±0.26

halfcheetah-medium-replay-v2 35.66 ±2.33 23.59 ±6.95 44.84 ±0.59 45.13 ±0.88 45.04 ±0.27 44.46 ±0.22 51.18 ±0.31 60.70 ±1.01 62.06 ±1.10 38.91 ±0.50

halfcheetah-medium-expert-v2 55.95 ±7.35 90.10 ±2.45 90.78 ±6.04 95.00 ±0.61 95.63 ±0.42 94.74 ±0.52 103.80 ±2.95 98.96 ±9.31 104.76 ±0.64 91.55 ±0.95

hopper-medium-v2 53.51 ±1.76 55.48 ±7.30 60.37 ±3.49 63.02 ±4.56 59.08 ±3.77 67.53 ±3.78 102.29 ±0.17 40.82 ±9.91 101.70 ±0.28 65.10 ±1.61

hopper-medium-replay-v2 29.81 ±2.07 70.42 ±8.66 64.42 ±21.52 98.88 ±2.07 95.11 ±5.27 97.43 ±6.39 94.98 ±6.53 100.33 ±0.78 99.66 ±0.81 81.77 ±6.87

hopper-medium-expert-v2 52.30 ±4.01 111.16 ±1.03 101.17 ±9.07 101.90 ±6.22 99.26 ±10.91 107.42 ±7.80 109.45 ±2.34 101.31 ±11.63 105.19 ±10.08 110.44 ±0.33

walker2d-medium-v2 63.23 ±16.24 67.34 ±5.17 82.71 ±4.78 68.52 ±27.19 80.75 ±3.28 80.91 ±3.17 85.82 ±0.77 87.47 ±0.66 93.36 ±1.38 67.63 ±2.54

walker2d-medium-replay-v2 21.80 ±10.15 54.35 ±6.34 85.62 ±4.01 80.62 ±3.58 73.09 ±13.22 82.15 ±3.03 84.25 ±2.25 78.99 ±0.50 87.10 ±2.78 59.86 ±2.73

walker2d-medium-expert-v2 98.96 ±15.98 108.70 ±0.25 110.03 ±0.36 111.44 ±1.62 109.56 ±0.39 111.72 ±0.86 111.86 ±0.43 114.93 ±0.41 114.75 ±0.74 107.11 ±0.96

Gym-MuJoCo avg 50.40 69.29 76.45 79.39 78.28 81.63 89.74 83.52 92.92 73.84

maze2d-umaze-v1 0.36 ±8.69 12.18 ±4.29 29.41 ±12.31 65.65 ±5.34 -8.90 ±6.11 42.11 ±0.58 106.87 ±22.16 130.59 ±16.52 95.26 ±6.39 18.08 ±25.42

maze2d-medium-v1 0.79 ±3.25 14.25 ±2.33 59.45 ±36.25 84.63 ±35.54 86.11 ±9.68 34.85 ±2.72 105.11 ±31.67 88.61 ±18.72 57.04 ±3.45 31.71 ±26.33

maze2d-large-v1 2.26 ±4.39 11.32 ±5.10 97.10 ±25.41 215.50 ±3.11 23.75 ±36.70 61.72 ±3.50 78.33 ±61.77 204.76 ±1.19 95.60 ±22.92 35.66 ±28.20

Maze2d avg 1.13 12.58 61.99 121.92 33.65 46.23 96.77 141.32 82.64 28.48

antmaze-umaze-v2 55.25 ±4.15 65.75 ±5.26 70.75 ±39.18 56.75 ±9.09 92.75 ±1.92 77.00 ±5.52 97.75 ±1.48 0.00 ±0.00 0.00 ±0.00 57.00 ±9.82

antmaze-umaze-diverse-v2 47.25 ±4.09 44.00 ±1.00 44.75 ±11.61 54.75 ±8.01 37.25 ±3.70 54.25 ±5.54 83.50 ±7.02 0.00 ±0.00 0.00 ±0.00 51.75 ±0.43

antmaze-medium-play-v2 0.00 ±0.00 2.00 ±0.71 0.25 ±0.43 0.00 ±0.00 65.75 ±11.61 65.75 ±11.71 89.50 ±3.35 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00

antmaze-medium-diverse-v2 0.75 ±0.83 5.75 ±9.39 0.25 ±0.43 0.00 ±0.00 67.25 ±3.56 73.75 ±5.45 83.50 ±8.20 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00

antmaze-large-play-v2 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00 20.75 ±7.26 42.00 ±4.53 52.25 ±29.01 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00

antmaze-large-diverse-v2 0.00 ±0.00 0.75 ±0.83 0.00 ±0.00 0.00 ±0.00 20.50 ±13.24 30.25 ±3.63 64.00 ±5.43 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00

AntMaze avg 17.21 19.71 19.33 18.58 50.71 57.17 78.42 0.00 0.00 18.12

pen-human-v1 71.03 ±6.26 26.99 ±9.60 -3.88 ±0.21 76.65 ±11.71 13.71 ±16.98 78.49 ±8.21 103.16 ±8.49 6.86 ±5.93 5.07 ±6.16 67.68 ±5.48

pen-cloned-v1 51.92 ±15.15 46.67 ±14.25 5.13 ±5.28 85.72 ±16.92 1.04 ±6.62 83.42 ±8.19 102.79 ±7.84 31.35 ±2.14 12.02 ±1.75 64.43 ±1.43

pen-expert-v1 109.65 ±7.28 114.96 ±2.96 122.53 ±21.27 159.91 ±1.87 -1.41 ±2.34 128.05 ±9.21 152.16 ±6.33 87.11 ±48.95 -1.55 ±0.81 116.38 ±1.27

door-human-v1 2.34 ±4.00 -0.13 ±0.07 -0.33 ±0.01 2.39 ±2.26 5.53 ±1.31 3.26 ±1.83 -0.10 ±0.01 -0.38 ±0.00 -0.12 ±0.13 4.44 ±0.87

door-cloned-v1 -0.09 ±0.03 0.29 ±0.59 -0.34 ±0.01 -0.01 ±0.01 -0.33 ±0.01 3.07 ±1.75 0.06 ±0.05 -0.33 ±0.00 2.66 ±2.31 7.64 ±3.26

door-expert-v1 105.35 ±0.09 104.04 ±1.46 -0.33 ±0.01 104.57 ±0.31 -0.32 ±0.02 106.65 ±0.25 106.37 ±0.29 -0.33 ±0.00 106.29 ±1.73 104.87 ±0.39

hammer-human-v1 3.03 ±3.39 -0.19 ±0.02 1.02 ±0.24 1.01 ±0.51 0.14 ±0.11 1.79 ±0.80 0.24 ±0.24 0.24 ±0.00 0.28 ±0.18 1.28 ±0.15

hammer-cloned-v1 0.55 ±0.16 0.12 ±0.08 0.25 ±0.01 1.27 ±2.11 0.30 ±0.01 1.50 ±0.69 5.00 ±3.75 0.14 ±0.09 0.19 ±0.07 1.82 ±0.55

hammer-expert-v1 126.78 ±0.64 121.75 ±7.67 3.11 ±0.03 127.08 ±0.13 0.26 ±0.01 128.68 ±0.33 133.62 ±0.27 25.13 ±43.25 28.52 ±49.00 117.45 ±6.65

relocate-human-v1 0.04 ±0.03 -0.14 ±0.08 -0.29 ±0.01 0.45 ±0.53 0.06 ±0.03 0.12 ±0.04 0.16 ±0.30 -0.31 ±0.01 -0.17 ±0.17 0.05 ±0.01

relocate-cloned-v1 -0.06 ±0.01 -0.00 ±0.02 -0.30 ±0.01 -0.01 ±0.03 -0.29 ±0.01 0.04 ±0.01 1.66 ±2.59 -0.01 ±0.10 0.17 ±0.35 0.16 ±0.09

relocate-expert-v1 107.58 ±1.20 97.90 ±5.21 -1.73 ±0.96 109.52 ±0.47 -0.30 ±0.02 106.11 ±4.02 107.52 ±2.28 -0.36 ±0.00 71.94 ±18.37 104.28 ±0.42

Adroit avg 48.18 42.69 10.40 55.71 1.53 53.43 59.39 12.43 18.78 49.21

Total avg 37.95 43.06 37.16 62.01 37.61 61.92 76.04 44.16 43.65 48.31

choice of hyperparameters, and we had to tune them a lot to make it work on each domain (see

Table 7). For example, AntMaze requires ﬁve hidden layers for the critic networks, while other tasks’

performance suffers with this number of layers. The issue of sensitivity

was already mentioned in

prior works as well (An et al., 2021; Ghasemipour et al., 2022).

Observation 3: CQL is extremely sensitive to the choice of hyperparameters and implementation

details.

Fourth, we also observe that the hyperparameters do not always work the same way when transferring

between Deep Learning frameworks

. Our implementations of IQL and CQL use PyTorch, but the

parameters from reference JAX implementations sometimes strongly underperform (e.g., IQL on

Hopper tasks and CQL on Adroit).

CORL Research-oriented Deep Offline Reinforcement Learning Library Denis Tarasov

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: