and problems being implemented in Python. Consequently,
RL experiments may cause unnecessary computing costs and
computation time, which results in a higher carbon emission
footprint [16]. Our concern is further increased by comparing
the number of RL environment implementations in Python
versus other low-level programming languages.
Our contribution addresses this gap by developing an al-
ternative to AI Gym without these adverse side effects by
offering a comparable interface and increasing computational
efficiency. As a result, we hope to reduce the carbon emissions
of RL experiments for a more sustainable AI.
B. Contribution Scope
We propose the CaiRL Environment toolkit to fill the gap
of a flexible and high-performance toolkit for running rein-
forcement learning experiments. CaiRL is a C++ interface to
improve setup, development, and execution times. Our toolkit
moves a considerable amount of computation to compile
time, which substantially reduces load times and the run-time
computation requirements for environments implemented in
the toolkit. CaiRL aims to have a near-identical interface to
AI Gym, ensuring that migrating existing codebases requires
minimal effort. As part of the CaiRL toolkit, we present, to
the best of our knowledge, the first Adobe Flash compatible
RL interface with support for Actionscript 2 and 3.
Additionally, CaiRL supports environments running in the
Java Virtual Machine (JVM), enabling the toolkit to run Java
seamlessly if porting code to C++ is impractical. Finally,
CaiRL supports the widely used AI Gym toolkit, enabling
existing Python environments to run seamlessly. Our contri-
butions summarize as follows:
1) Implement a more climate-sustainable and efficient ex-
periment execution toolkit for RL research.
2) Contribute novel problems for reinforcement learning
research as part of the CaiRL ecosystem.
3) Empirically demonstrate the performance effectiveness
of CaiRL.
4) Show that our solution effectively reduces the carbon
emission footprint when measuring following the met-
rics in [17].
5) Evaluate the training speed of CaiRL and AI Gym and
empirically verify that improving environment execution
times can substantially reduce the wall-clock time used
to learn RL agents.
C. Paper Organization
In Section 2, we dive into the existing literature on re-
inforcement learning game design and compare the existing
solution to find the gap for our research question. Section 3
details reinforcement learning from the perspective of CaiRL
and the problem we aim to solve. Section 4 details the design
choices of CaiRL and provides a thorough justification for
design choices. Section 5 presents our empirical findings of
performance, adoption challenges, and how they are solved,
and finally compares the interface of the CaiRL framework
to OpenAI Gym (AI Gym). Section 6 presents a brief design
recommendation for developers of new environments aimed
at reinforcement learning research. Finally, we conclude our
work and outline a path forwards for adopting CaiRL.
II. BACKGROUND
A. Reinforcement Learning
Reinforcement Learning is modeled according to a Markov
Decision Process (MDP) described formally by a tuple
(S, A, T, R, γ, s0).Sis the state-space, Ais the action-space,
T:S×A→Sis the transition function, R:S×A→R
is the reward function [18], γis the discount factor, and s0
is starting state. In the context of RL, the agent operates
iteratively until reaching a terminal state, at which time the
program terminates. Q-Learning is an off-policy RL algorithm
and seeks to find the best action to take given the current
state. The algorithm operates off a Q-table, an n-dimensional
matrix that follows the shape of state dimensions where the
final dimension is the Q-values. Q-Values quantify how good
it is to act aat time t. This work uses Deep Learning
function approximators in place of Q-tables to allow training
in high-dimension domains [19]. This forms the algorithm
Deep Q-Network (DQN), one of the first deep learning-
based approaches to RL, and is commonly known for solving
Atari 2600 with superhuman performance [19]. Section V-C
demonstrates that our toolkit significantly reduces the run-
time and carbon emission footprint when training DQN in
traditional control environments.
B. Graphics Acceleration
A graphics accelerator or a graphical processing unit (GPU)
intends to execute machine code to produce images stored in
a frame buffer. The machine code instructions are generated
using a rendering unit that communicates with the central
processing unit (CPU) or the GPU. These methods are called
software rendering or hardware rendering, respectively. GPUs
are specialized electronics for calculating graphics with vastly
superior parallelization capabilities to their software counter-
part, the CPU. Therefore, hardware rendering is typically pre-
ferred for computationally heavy rendering workloads. Con-
sequently, it is reasonable to infer that hardware-accelerated
graphics provide the best performance due to their improved
capacity to generate frames quickly. On the other hand, we
note that when the rendering process is relatively basic (e.g.,
2D graphics) and access to the frame buffer is desired, the
expense of moving the frame buffer from GPU memory to
CPU memory dramatically outweighs the benefits. [20]
According to [20], software rendering in modern CPU chips
performs 2-10x faster due to specialized bytecode instructions.
This study concludes that the GPU can render frames faster,
provided that the frame permanently resides in GPU memory.
Having frames in the GPU memory is impractical for machine
learning applications because of the copy between the CPU
and GPU. The authors in [21] propose using Single Instruction
Multiple Data (SIMD) optimizations to improve game perfor-
mance. SIMD extends the CPU instruction set for vectorized
arithmetic to increase instruction throughput. The authors find