an embedding directly from images [24], [25]. While model-
based learning has been shown to work well in some complex
dynamic environments [26], model-free methods remain a
popular choice in the dynamic locomotion community [16],
[17]. Others have turned to embedded and more descriptive
action spaces [27], [28], [29], [30] and reduced order mod-
els [23] to enable more robust and sample efficient learning.
However, these efforts have mainly ignored the impact of
observation space compression on model-free learning.
Quantization: Quantization or discretization of deep neu-
ral network weights and parameters has seen increased
popularity [31], [32]. These methods are able to achieve
competitive performance on classic deep computer vision
tasks while reducing the size footprint of the convolutional
neural network (CNN) models used. More recently, work has
been done to adapt these same techniques to reinforcement
learning (RL) [33], [34]. This work has mainly focused on
the quantization of the parameters of the critic or value
networks, in order to reduce overall model size and speed up
learning. However, little has been done to evaluate the effect
of applying quantization to a task’s observation space.
III. LEARNING BACKGROUND
Reinforcement learning poses problems as Markov de-
cision processes (MDPs), where an MDP is defined by a
set of observed and hidden states, S, actions, A, stochastic
dynamics, p(st+1|st, at), a reward function r(s, a), and a
discount factor, γ. The RL objective is to compute the policy,
π∗(s, a), that maximizes the expected discounted sum of
future rewards, Es,a (Ptγtrt). We train all learning tasks
with two state-of-the-art on- and off-policy reinforcement
learning algorithms: Proximal Policy Optimization (PPO)
and Soft Actor-Critic (SAC).
A. Proximal Policy Optimization (PPO)
Proximal Policy Optimization [35] is an on-policy, policy
gradient [36] algorithm that employs an actor-critic frame-
work to learn both the optimal policy, π∗(s, a), as well as
the optimal value function, V∗(s). Both the policy πθand
value function Vφare parameterized by neural networks with
weights θand φrespectively. Similar to Trust Region Policy
Optimization (TRPO) [37], PPO stabilizes policy training by
penalizing large policy updates. Additionally, as all updates
are computed with samples taken from the current policy,
PPO often requires high sample complexity.
During training PPO needs to store the parameters of both
its value and policy networks1–usually shallow multilayer
perceptrons (MLPs)–as well as its on-policy rollout buffer
Dk.Dkstores (s, a, r)tuples, which are refreshed during
each iteration of the algorithm.
By combining both the models and rollout buffer, the
general space complexity of PPO can be written as:
size =sizeof(πθ)+sizeof(Vφ)+sizeof(Dk).(1)
1Many PPO implementations save space by using one central MLP with
two additional single-layer policy and value function model heads. We focus
on the standard PPO model approach in this section for clarity.
Importantly, despite throwing out and re-collecting an en-
tirely new rollout buffer during each iteration, Dkdominates
the overall memory footprint of PPO.
For example, Miki et al. [2] uses a student-teacher ap-
proach to bridge the sim-to-real gap and enable robust
quadrupedal locomotion in rough, wilderness terrain. This
process starts by using PPO to train the teacher model–a 3-
layer MLP fed by two smaller encoder networks. Assuming
all parameters are stored as 64-bit floats, the three models
have a combined size of ∼1.3MB. In comparison, the 391-
dimensional observation space, 16-dimensional action space,
and batch size of 8,300 leads to a rollout buffer of over
27MB of space (assuming 64-bit floats). This means that Dk
accounts for 95% of the total memory footprint. Furthermore,
this size is dominated by the stored observations, which
account for 96% of the size of Dkand thus 92% of the
total memory footprint.
B. Soft Actor-Critic (SAC)
Soft Actor-Critic [14], [6] is an off-policy RL algorithm
that generally extends soft Q-learning (SQL) [38] and op-
timizes a “maximum entropy” objective, which promotes
exploration according to a temperature parameter, α:
E(st,at)∼π"X
t
γtr(st, at) + αH(π(·|st))#.(2)
SAC makes a number of improvements on SQL by auto-
matically tuning the temperature parameter αusing double
Q-learning, similar to the Twin Delayed DDPG (TD3) algo-
rithm [39], to correct for overestimation in the Q-function,
and learning not only the Q-functions and the policy, but also
the value function.
Like PPO, SAC’s memory usage comes from its models
and off-policy replay buffer. Shallow MLPs are once again
used to approximate the policy πθas well as the two Q-
functions Qφ1, Qφ2. Unlike PPO, however, the replay buffer,
D, is generally much larger in size as it stores trajectories
from every iteration, usually acting as a size limited queue.
By combining the models and replay buffer, the general
space complexity of SAC can be written as:
size =sizeof(πθ)+2∗sizeof(Qφ) + sizeof(D).
(3)
As in the case of PPO, Ddominates the memory footprint
of SAC. For example, Haarnoja et al. [14] used SAC to train
the quadruped Minitaur to walk. The combined number of
parameters in their policy and two value networks (ignoring
bias terms and again assuming 64-bit floats) was about
2.3MB. With an 112-dimensional observation space, an 8-
dimensional action space, and a replay buffer of size 1e6,2
Dconsumed about 96.8MB of memory, equating to 97.7% of
the total memory footprint. Again, the size of Dis dominated
by the observations, which account for 92.5% of its size and
90.4% of the total memory footprint.
2A conservative estimate as they collect 100k-200k samples.