CORL Research-oriented Deep Offline Reinforcement Learning Library Denis Tarasov

2025-05-06 0 0 3.99MB 24 页 10玖币
侵权投诉
CORL: Research-oriented Deep Offline Reinforcement
Learning Library
Denis Tarasov
Tinkoff
den.tarasov@tinkoff.ai
Alexander Nikulin
Tinkoff
a.p.nikulin@tinkoff.ai
Dmitry Akimov
Tinkoff
d.akimov@tinkoff.ai
Vladislav Kurenkov
Tinkoff
v.kurenkov@tinkoff.ai
Sergey Kolesnikov
Tinkoff
s.s.kolesnikov@tinkoff.ai
Abstract
CORL
1
is an open-source library that provides thoroughly benchmarked single-file
implementations of both deep offline and offline-to-online reinforcement learning
algorithms. It emphasizes a simple developing experience with a straightforward
codebase and a modern analysis tracking tool. In CORL, we isolate methods
implementation into separate single files, making performance-relevant details
easier to recognize. Additionally, an experiment tracking feature is available to
help log metrics, hyperparameters, dependencies, and more to the cloud. Finally,
we have ensured the reliability of the implementations by benchmarking commonly
employed D4RL datasets providing a transparent source of results that can be
reused for robust evaluation tools such as performance profiles, probability of
improvement, or expected online performance.
1 Introduction
Deep Offline Reinforcement Learning (Levine et al., 2020) has been showing significant advancements
in numerous domains such as robotics (Smith et al., 2022; Kumar et al., 2021), autonomous driving
(Diehl et al., 2021) and recommender systems (Chen et al., 2022). Due to such rapid development,
many open-source offline RL solutions
2
emerged to help RL practitioners understand and improve
well-known offline RL techniques in different fields. On the one hand, they introduce offline RL
algorithms standard interfaces and user-friendly APIs, simplifying offline RL methods incorporation
into existing projects. On the other hand, introduced abstractions may hinder the learning curve for
newcomers and the ease of adoption for researchers interested in developing new algorithms. One
needs to understand the modularity design (several files on average), which (1) can be comprised of
thousands of lines of code or (2) can hardly fit for a novel method3.
In this technical report, we take a different perspective on an offline RL library and also incorporate
emerging interest in the offline-to-online setup. We propose CORL (Clean Offline Reinforcement
Learning) – minimalistic and isolated single-file implementations of deep offline and offline-to-online
RL algorithms, supported by open-sourced D4RL (Fu et al., 2020) benchmark results. The uncom-
plicated design allows practitioners to read and understand the implementations of the algorithms
straightforwardly. Moreover, CORL supports optional integration with experiments tracking tools
such as Weighs&Biases (Biewald, 2020), providing practitioners with a convenient way to analyze
1CORL Repository: https://github.com/corl-team/CORL
2https://github.com/hanjuku-kaso/awesome-offline-rl#oss
3https://github.com/takuseno/d3rlpy/issues/141
37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks.
arXiv:2210.07105v4 [cs.LG] 26 Oct 2023
the results and behavior of all algorithms, not merely relying on a final performance commonly
reported in papers.
We hope that the CORL library will help offline RL newcomers study implemented algorithms and
aid the researchers in quickly modifying existing methods without fighting through different levels
of abstraction. Finally, the obtained results may serve as a reference point for D4RL benchmarks
avoiding the need to re-implement and tune existing algorithms’ hyperparameters.
Ya Cgai Fie Sige- Fie eeai Eeie Tackig Lg
eie aa
agih aa
AWAC / BC / CQL / DT
EDAC / QL / SAC- N / TD3+BC
h d. -- cg=cfg/d- he.a -- gdi=g/d- he -- - ech=50
Wadb g
Wadb e
Figure 1: The illustration of the CORL library design. Single-file implementation takes a yaml
configuration file with both environment and algorithm parameters to run the experiment, which logs
all required statistics to Weights&Biases (Biewald, 2020).
2 Related Work
Since the Atari breakthrough (Mnih et al., 2015), numerous open-source RL frameworks and libraries
have been developed over the last years: (Dhariwal et al., 2017; Hill et al., 2018; Castro et al., 2018;
Gauci et al., 2018; Keng & Graesser, 2017; garage contributors, 2019; Duan et al., 2016; Kolesnikov
& Hrinchuk, 2019; Fujita et al., 2021; Liang et al., 2018; Fujita et al., 2021; Liu et al., 2021; Huang
et al., 2021; Weng et al., 2021; Stooke & Abbeel, 2019), focusing on different perspectives of
the RL. For example, stable-baselines (Hill et al., 2018) provides many deep RL implementations
that carefully reproduce results to back up RL practitioners with reliable baselines during methods
comparison. On the other hand, Ray (Liang et al., 2018) focuses on implementations scalability and
production-friendly usage. Finally, more nuanced solutions exist, such as Dopamine (Castro et al.,
2018), which emphasizes different DQN variants, or ReAgent (Gauci et al., 2018), which applies RL
to the RecSys domain.
At the same time, the offline RL branch and especially offline-to-online, which we are interested
in this paper, are not yet covered as much: the only library that precisely focuses on offline RL
setting is d3rlpy (Takuma Seno, 2021). While CORL also focuses on offline RL methods (Nair
et al., 2020; Kumar et al., 2020; Kostrikov et al., 2021; Fujimoto & Gu, 2021; An et al., 2021;
Chen et al., 2021), similar to d3rlpy, it takes a different perspective on library design and provides
non-modular independent algorithms implementations. More precisely, CORL does not introduce
additional abstractions to make offline RL more general but instead gives an "easy-to-hack" starter
kit for research needs. Finally, CORL also provides recent offline-to-online solutions (Nair et al.,
2020; Kumar et al., 2020; Kostrikov et al., 2021; Wu et al., 2022; Nakamoto et al., 2023; Tarasov
et al., 2023) that are gaining interest among researchers and practitioners.
2
Although CORL does not represent the first non-modular RL library, which is more likely the CleanRL
(Huang et al., 2021) case, it has two significant differences from its predecessor. First, CORL is
focused on offline and offline-to-online RL, while CleanRL implements online RL algorithms. Second,
CORL intends to minimize the complexity of the requirements and external dependencies. To be more
concrete, CORL does not have additional requirements with abstractions such as
stable
-
baselines
(Hill et al., 2018) or
envpool
(Weng et al., 2022) but instead implements everything from scratch in
the codebase.
3 CORL Design
Single-File Implementations
Implementational subtleties significantly impact agent performance in deep RL (Henderson et al.,
2018; Engstrom et al., 2020; Fujimoto & Gu, 2021). Unfortunately, user-friendly abstractions and
general interfaces, the core idea behind modular libraries, encapsulate and often hide these important
nuances from the practitioners. For such a reason, CORL unwraps these details by adopting single-file
implementations. To be more concrete, we put environment details, algorithms hyperparameters, and
evaluation parameters into a single file4. For example, we provide
any_percent_bc.py
(404 LOC
5
) as a baseline algorithm for offline RL methods comparison,
td3_bc.py
(511 LOC) as a competitive minimalistic offline RL algorithm (Fujimoto & Gu,
2021),
dt.py
(540 LOC) as an example of the recently proposed trajectory optimization ap-
proach (Chen et al., 2021)
Figure 1 depicts an overall library design. To avoid over-complicated offline implementations, we treat
offline and offline-to-online versions of the same algorithms separately. While such design produces
code duplications among realization, it has several essential benefits from the both educational and
research perspective:
Smooth learning curve. Having the entire code in one place makes understanding all its
aspects more straightforward. In other words, one may find it easier to dive into 540 LOC of
single-file Decision Transformer (Chen et al., 2021) implementation rather than 10+ files of
the original implementation6.
Simple prototyping. As we are not interested in the code’s general applicability, we could
make it implementation-specific. Such a design also removes the need for inheritance from
general primitives or their refactoring, reducing abstraction overhead to zero. At the same
time, this idea gives us complete freedom during code modification.
Faster debugging. Without additional abstractions, implementation simplifies to a single
for-loop with a global Python name scope. Furthermore, such flat architecture makes
accessing and inspecting any created variable easier during training, which is crucial in the
presence of modifications and debugging.
Configuration files
Although it is a typical pattern to use a command line interface (CLI) for single-file experiments in the
research community, CORL slightly improves it with predefined configuration files. Utilizing YAML
parsing through CLI, for each experiment, we gather all environment and algorithm hyperparameters
into such files so that one can use them as an initial setup. We found that such setup (1) simplifies
experiments, eliminating the need to keep all algorithm-environment-specific parameters in mind,
and (2) keeps it convenient with the familiar CLI approach.
4We follow the PEP8 style guide with a maximum line length of 89, which increases LOC a bit.
5Lines Of Code
6Original Decision Transformer implementation: https://github.com/kzl/decision-transformer
3
Experiment Tracking
Offline RL evaluation is another challenging aspect of the current offline RL state (Kurenkov &
Kolesnikov, 2022). To face this uncertainty, CORL supports integration with Weights&Biases
(Biewald, 2020), a modern experiment tracking tool. With each experiment, CORL automatically
saves (1) source code, (2) dependencies (requirements.txt), (3) hardware setup, (4) OS environment
variables, (5) hyperparameters, (6) training, and system metrics, (7) logs (stdout, stderr). See
Appendix B for an example.
Although, Weights&Biases is a proprietary solution, other alternatives, such as Tensorboard (Abadi
et al., 2015) or Aim (Arakelyan et al., 2020), could be used within a few lines of code change. It is
also important to note that with Weights&Biases tracking, one could easily use CORL with sweeps
or public reports.
We found full metrics tracking during the training process necessary for two reasons. First, it removes
the possible bias of the final or best performance commonly reported in papers. For example, one
could evaluate offline RL performance as max archived score, while another uses the average scores
over
N
(last) evaluations (Takuma Seno, 2021). Second, it provides an opportunity for advanced
performance analysis such as EOP (Kurenkov & Kolesnikov, 2022) or RLiable (Agarwal et al., 2021).
In short, when provided with all metrics logs, one can utilize all performance statistics, not merely
relying on commonly used alternatives.
4 Benchmarking D4RL
4.1 Offline
In our library, we implemented the following offline algorithms:
N%7
Behavioral Cloning (BC), TD3
+ BC (Fujimoto & Gu, 2021), CQL (Kumar et al., 2020), IQL (Kostrikov et al., 2021), AWAC (Nair
et al., 2020), ReBRAC (Tarasov et al., 2023), SAC-N, EDAC (An et al., 2021), and Decision
Transformer (DT) (Chen et al., 2021). We evaluated every algorithm on the D4RL benchmark (Fu
et al., 2020), focusing on Gym-MuJoCo, Maze2D, AntMaze, and Adroit tasks. Each algorithm was
run for one million gradient steps
8
and evaluated using ten episodes for Gym-MuJoCo and Adroit
tasks. For Maze2d, we use 100 evaluation episodes. In our experiments, we tried to rely on the
hyperparameters proposed in the original works (see Appendix D for details) as much as possible.
The final performance is reported in Table 1 and the maximal performance in Table 2. The scores
are normalized to the range between 0 and 100 (Fu et al., 2020). Following the recent work by
Takuma Seno (2021), we report the last and best-obtained scores to illustrate each algorithm’s potential
performance and overfitting properties. Figure 2 shows the performance profiles and probability of
improvement of ReBRAC over other algorithms (Agarwal et al., 2021). See Appendix A for complete
training performance graphs.
Based on these results, we make several valuable observations. First, ReBRAC, IQL and AWAC are
the most competitive baselines in offline setup on average. Note that AWAC is often omitted in recent
works.
Observation 1: ReBRAC, IQL and AWAC are the strongest offline baselines on average.
Second, EDAC outperforms all other algorithms on Gym-MuJoCo by a significant margin, and to our
prior knowledge, there are still no algorithms that perform much better on these tasks. SAC-N shows
the best performance on Maze2d tasks. However, simultaneously, SAC-N and EDAC cannot solve
AntMaze tasks and perform poorly in the Adroit domain.
Observation 2: SAC-N and EDAC are the strongest baselines for Gym-MuJoCo and Maze2d,
but they perform poorly on both AntMaze and Adroit domains.
Third, during our experiments, we observed that the hyperparameters proposed for CQL in Kumar
et al. (2020) do not perform as well as claimed on most tasks. CQL is extremely sensitive to the
7N
is a percentage of best trajectories with the highest return used for training. We omit the percentage when
it is equal to 100.
8Except SAC-N, EDAC, and DT due to their original hyperparameters. See Appendix D for details.
4
(a) (b)
Figure 2: (a) Performance profiles after offline training (b) Probability of improvement of ReBRAC
to other algorithms after offline training. The curves (Agarwal et al., 2021) are for D4RL benchmark
spanning Gym-MuJoCo, Maze2d, AntMaze, and Adroit datasets.
Table 1: Normalized performance of the last trained policy on D4RL averaged over 4 random seeds.
Task Name BC BC-10% TD3+BC AWAC CQL IQL ReBRAC SAC-NEDAC DT
halfcheetah-medium-v2 42.40 ±0.19 42.46 ±0.70 48.10 ±0.18 50.02 ±0.27 47.04 ±0.22 48.31 ±0.22 64.04 ±0.68 68.20 ±1.28 67.70 ±1.04 42.20 ±0.26
halfcheetah-medium-replay-v2 35.66 ±2.33 23.59 ±6.95 44.84 ±0.59 45.13 ±0.88 45.04 ±0.27 44.46 ±0.22 51.18 ±0.31 60.70 ±1.01 62.06 ±1.10 38.91 ±0.50
halfcheetah-medium-expert-v2 55.95 ±7.35 90.10 ±2.45 90.78 ±6.04 95.00 ±0.61 95.63 ±0.42 94.74 ±0.52 103.80 ±2.95 98.96 ±9.31 104.76 ±0.64 91.55 ±0.95
hopper-medium-v2 53.51 ±1.76 55.48 ±7.30 60.37 ±3.49 63.02 ±4.56 59.08 ±3.77 67.53 ±3.78 102.29 ±0.17 40.82 ±9.91 101.70 ±0.28 65.10 ±1.61
hopper-medium-replay-v2 29.81 ±2.07 70.42 ±8.66 64.42 ±21.52 98.88 ±2.07 95.11 ±5.27 97.43 ±6.39 94.98 ±6.53 100.33 ±0.78 99.66 ±0.81 81.77 ±6.87
hopper-medium-expert-v2 52.30 ±4.01 111.16 ±1.03 101.17 ±9.07 101.90 ±6.22 99.26 ±10.91 107.42 ±7.80 109.45 ±2.34 101.31 ±11.63 105.19 ±10.08 110.44 ±0.33
walker2d-medium-v2 63.23 ±16.24 67.34 ±5.17 82.71 ±4.78 68.52 ±27.19 80.75 ±3.28 80.91 ±3.17 85.82 ±0.77 87.47 ±0.66 93.36 ±1.38 67.63 ±2.54
walker2d-medium-replay-v2 21.80 ±10.15 54.35 ±6.34 85.62 ±4.01 80.62 ±3.58 73.09 ±13.22 82.15 ±3.03 84.25 ±2.25 78.99 ±0.50 87.10 ±2.78 59.86 ±2.73
walker2d-medium-expert-v2 98.96 ±15.98 108.70 ±0.25 110.03 ±0.36 111.44 ±1.62 109.56 ±0.39 111.72 ±0.86 111.86 ±0.43 114.93 ±0.41 114.75 ±0.74 107.11 ±0.96
Gym-MuJoCo avg 50.40 69.29 76.45 79.39 78.28 81.63 89.74 83.52 92.92 73.84
maze2d-umaze-v1 0.36 ±8.69 12.18 ±4.29 29.41 ±12.31 65.65 ±5.34 -8.90 ±6.11 42.11 ±0.58 106.87 ±22.16 130.59 ±16.52 95.26 ±6.39 18.08 ±25.42
maze2d-medium-v1 0.79 ±3.25 14.25 ±2.33 59.45 ±36.25 84.63 ±35.54 86.11 ±9.68 34.85 ±2.72 105.11 ±31.67 88.61 ±18.72 57.04 ±3.45 31.71 ±26.33
maze2d-large-v1 2.26 ±4.39 11.32 ±5.10 97.10 ±25.41 215.50 ±3.11 23.75 ±36.70 61.72 ±3.50 78.33 ±61.77 204.76 ±1.19 95.60 ±22.92 35.66 ±28.20
Maze2d avg 1.13 12.58 61.99 121.92 33.65 46.23 96.77 141.32 82.64 28.48
antmaze-umaze-v2 55.25 ±4.15 65.75 ±5.26 70.75 ±39.18 56.75 ±9.09 92.75 ±1.92 77.00 ±5.52 97.75 ±1.48 0.00 ±0.00 0.00 ±0.00 57.00 ±9.82
antmaze-umaze-diverse-v2 47.25 ±4.09 44.00 ±1.00 44.75 ±11.61 54.75 ±8.01 37.25 ±3.70 54.25 ±5.54 83.50 ±7.02 0.00 ±0.00 0.00 ±0.00 51.75 ±0.43
antmaze-medium-play-v2 0.00 ±0.00 2.00 ±0.71 0.25 ±0.43 0.00 ±0.00 65.75 ±11.61 65.75 ±11.71 89.50 ±3.35 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00
antmaze-medium-diverse-v2 0.75 ±0.83 5.75 ±9.39 0.25 ±0.43 0.00 ±0.00 67.25 ±3.56 73.75 ±5.45 83.50 ±8.20 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00
antmaze-large-play-v2 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00 20.75 ±7.26 42.00 ±4.53 52.25 ±29.01 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00
antmaze-large-diverse-v2 0.00 ±0.00 0.75 ±0.83 0.00 ±0.00 0.00 ±0.00 20.50 ±13.24 30.25 ±3.63 64.00 ±5.43 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00
AntMaze avg 17.21 19.71 19.33 18.58 50.71 57.17 78.42 0.00 0.00 18.12
pen-human-v1 71.03 ±6.26 26.99 ±9.60 -3.88 ±0.21 76.65 ±11.71 13.71 ±16.98 78.49 ±8.21 103.16 ±8.49 6.86 ±5.93 5.07 ±6.16 67.68 ±5.48
pen-cloned-v1 51.92 ±15.15 46.67 ±14.25 5.13 ±5.28 85.72 ±16.92 1.04 ±6.62 83.42 ±8.19 102.79 ±7.84 31.35 ±2.14 12.02 ±1.75 64.43 ±1.43
pen-expert-v1 109.65 ±7.28 114.96 ±2.96 122.53 ±21.27 159.91 ±1.87 -1.41 ±2.34 128.05 ±9.21 152.16 ±6.33 87.11 ±48.95 -1.55 ±0.81 116.38 ±1.27
door-human-v1 2.34 ±4.00 -0.13 ±0.07 -0.33 ±0.01 2.39 ±2.26 5.53 ±1.31 3.26 ±1.83 -0.10 ±0.01 -0.38 ±0.00 -0.12 ±0.13 4.44 ±0.87
door-cloned-v1 -0.09 ±0.03 0.29 ±0.59 -0.34 ±0.01 -0.01 ±0.01 -0.33 ±0.01 3.07 ±1.75 0.06 ±0.05 -0.33 ±0.00 2.66 ±2.31 7.64 ±3.26
door-expert-v1 105.35 ±0.09 104.04 ±1.46 -0.33 ±0.01 104.57 ±0.31 -0.32 ±0.02 106.65 ±0.25 106.37 ±0.29 -0.33 ±0.00 106.29 ±1.73 104.87 ±0.39
hammer-human-v1 3.03 ±3.39 -0.19 ±0.02 1.02 ±0.24 1.01 ±0.51 0.14 ±0.11 1.79 ±0.80 0.24 ±0.24 0.24 ±0.00 0.28 ±0.18 1.28 ±0.15
hammer-cloned-v1 0.55 ±0.16 0.12 ±0.08 0.25 ±0.01 1.27 ±2.11 0.30 ±0.01 1.50 ±0.69 5.00 ±3.75 0.14 ±0.09 0.19 ±0.07 1.82 ±0.55
hammer-expert-v1 126.78 ±0.64 121.75 ±7.67 3.11 ±0.03 127.08 ±0.13 0.26 ±0.01 128.68 ±0.33 133.62 ±0.27 25.13 ±43.25 28.52 ±49.00 117.45 ±6.65
relocate-human-v1 0.04 ±0.03 -0.14 ±0.08 -0.29 ±0.01 0.45 ±0.53 0.06 ±0.03 0.12 ±0.04 0.16 ±0.30 -0.31 ±0.01 -0.17 ±0.17 0.05 ±0.01
relocate-cloned-v1 -0.06 ±0.01 -0.00 ±0.02 -0.30 ±0.01 -0.01 ±0.03 -0.29 ±0.01 0.04 ±0.01 1.66 ±2.59 -0.01 ±0.10 0.17 ±0.35 0.16 ±0.09
relocate-expert-v1 107.58 ±1.20 97.90 ±5.21 -1.73 ±0.96 109.52 ±0.47 -0.30 ±0.02 106.11 ±4.02 107.52 ±2.28 -0.36 ±0.00 71.94 ±18.37 104.28 ±0.42
Adroit avg 48.18 42.69 10.40 55.71 1.53 53.43 59.39 12.43 18.78 49.21
Total avg 37.95 43.06 37.16 62.01 37.61 61.92 76.04 44.16 43.65 48.31
choice of hyperparameters, and we had to tune them a lot to make it work on each domain (see
Table 7). For example, AntMaze requires five hidden layers for the critic networks, while other tasks’
performance suffers with this number of layers. The issue of sensitivity
9
was already mentioned in
prior works as well (An et al., 2021; Ghasemipour et al., 2022).
Observation 3: CQL is extremely sensitive to the choice of hyperparameters and implementation
details.
Fourth, we also observe that the hyperparameters do not always work the same way when transferring
between Deep Learning frameworks
10
. Our implementations of IQL and CQL use PyTorch, but the
parameters from reference JAX implementations sometimes strongly underperform (e.g., IQL on
Hopper tasks and CQL on Adroit).
9
See also
https://github.com/aviralkumar2907/CQL/issues/9
,
https://github.com/
tinkoff-ai/CORL/issues/14 and https://github.com/young-geng/CQL/issues/5
10https://github.com/tinkoff-ai/CORL/issues/33
5
摘要:

CORL:Research-orientedDeepOfflineReinforcementLearningLibraryDenisTarasovTinkoffden.tarasov@tinkoff.aiAlexanderNikulinTinkoffa.p.nikulin@tinkoff.aiDmitryAkimovTinkoffd.akimov@tinkoff.aiVladislavKurenkovTinkoffv.kurenkov@tinkoff.aiSergeyKolesnikovTinkoffs.s.kolesnikov@tinkoff.aiAbstractCORL1isanopen-...

展开>> 收起<<
CORL Research-oriented Deep Offline Reinforcement Learning Library Denis Tarasov.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:3.99MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注