Just Round Quantized Observation Spaces Enable Memory Efﬁcient Learning of Dynamic Locomotion Lev Grossman1and Brian Plancher2

2025-05-06 0 0 4.57MB 6 页 10玖币

侵权投诉

Just Round: Quantized Observation Spaces Enable

Memory Efﬁcient Learning of Dynamic Locomotion

Lev Grossman1and Brian Plancher2

Abstract— Deep reinforcement learning (DRL) is one of the

most powerful tools for synthesizing complex robotic behaviors.

But training DRL models is incredibly compute and memory

intensive, requiring large training datasets and replay buffers to

achieve performant results. This poses a challenge for the next

generation of ﬁeld robots that will need to learn on the edge to

adapt to their environment. In this paper, we begin to address

this issue through observation space quantization. We evaluate

our approach using four simulated robot locomotion tasks and

two state-of-the-art DRL algorithms, the on-policy Proximal

Policy Optimization (PPO) and off-policy Soft Actor-Critic

(SAC) and ﬁnd that observation space quantization reduces

overall memory costs by as much as 4.2×without impacting

learning performance.

I. INTRODUCTION

Deep reinforcement learning (DRL) continues to see in-

creased attention by the robotics community due to its

ability to learn complex behaviors in both simulated and

real environments. These methods have been successfully

applied to a host of robotic tasks including: dexterous ma-

nipulation [1], quadrupedal locomotion [2], and high-speed

drone racing [3]. Despite these successes, DRL remains

largely sample inefﬁcient, depending on enormous amounts

of training data to learn. As much of this data is kept in

replay buffers during training, DRL is extremely memory

intensive, limiting the number of computational platforms

that can support such operations, and largely conﬁning state-

of-the-art model training to the cloud. For example, OpenAI

used 400 compute devices and collected two years worth of

experience data per hour in simulation in order to train a

physical dexterous robot hand to solve a Rubik’s cube [4].

At the same time, there has been a recent push to enable

learning on physical robot hardware [5]. Using environment

experience collected directly on-device and sending this

batched data over Ethernet to a workstation, researchers have

shown that DRL, and off-policy methods in particular, hold

promise in learning robust locomotive policies on physical

robot hardware [6], [7].

Unfortunately, for such real-world robots, a virtually un-

limited compute and memory budget is unrealistic. Still,

many robots, especially those involved in tasks as conse-

quential as search-and-rescue and space exploration [8], will

have to adapt to ever-changing environmental conditions and

continue to optimize and update their internal policies over

the course of their lifetime [9]. As such, to untether these

1Lev Grossman is with Berkshire Grey, Bedford, MA, USA.

lev.grossman@berkshiregrey.com

2Brian Plancher is with Barnard College, Columbia University, New

York, NY, USA. bplancher@barnard.edu

methods from the conﬁnes of a lab, data collection and

storage must be memory efﬁcient enough to allow for either

low-latency networking [10], [11] or for sufﬁcient experience

data to be stored on-board edge computing devices [12].

As fast and secure updates may not be possible in remote

locations or when using bandwidth-constrained or high-

latency cloud networks [13], it is imperative to ﬁnd ways

to reduce the overall memory footprint of DRL training.

In this work we begin to address this issue through obser-

vation space quantization. We focus on the observation space

in particular, as the observations stored in a replay buffer

generally consume over 90% of the total memory footprint

of DRL training for state-of-the-art locomotion policies [2],

[14]. Importantly, unlike simply reducing the number of ob-

servations stored in the buffer, which decreases the memory

footprint at the cost of reduced learning performance, our

quantization scheme is able to reduce memory usage without

impacting the training performance. We present experiments

across four popular simulated robotic locomotion domains,

using two of the most popular DRL algorithms, the on-policy

Proximal Policy Optimization (PPO) and off-policy Soft

Actor-Critic (SAC), and ﬁnd that our approach can reduce

the memory footprint by as much as 4.2×without impacting

training performance. We open-source our implementation

for use by the wider robot learning community.

II. RELATED WORK

Locomotive Learning and Efﬁciency: Deep reinforce-

ment learning (DRL) methods have been successfully applied

to a variety of complex simulated [15], [16] and real [17]

robotic locomotion tasks. Advances in asynchronous algo-

rithms [18] and massively parallel simulation [19] have en-

abled faster learning. However, DRL algorithms often remain

data intensive and prone to being sample inefﬁcient. This

inefﬁciency can be especially impactful when transferring

learned policies from simulation to physical robots [20] or

when learning policies directly on hardware [6], [7]. To

address this, some have found success in learning using

substantially reduced buffer sizes and intelligent replay sam-

pling [21]. Others have been able to increase the sample

efﬁciency of DRL based locomotion by using more power-

ful off-policy algorithms [14]. Still, improving the sample

efﬁciency and thus memory efﬁciency of DRL remains a

major area of interest in the robotic learning community.

Compressed and embedded spaces: Previous work on

learning in compressed or embedded spaces has mainly

surrounded model-based techniques [22], [23], often learning

arXiv:2210.08065v2 [cs.RO] 22 Apr 2023

an embedding directly from images [24], [25]. While model-

based learning has been shown to work well in some complex

dynamic environments [26], model-free methods remain a

popular choice in the dynamic locomotion community [16],

[17]. Others have turned to embedded and more descriptive

action spaces [27], [28], [29], [30] and reduced order mod-

els [23] to enable more robust and sample efﬁcient learning.

However, these efforts have mainly ignored the impact of

observation space compression on model-free learning.

Quantization: Quantization or discretization of deep neu-

ral network weights and parameters has seen increased

popularity [31], [32]. These methods are able to achieve

competitive performance on classic deep computer vision

tasks while reducing the size footprint of the convolutional

neural network (CNN) models used. More recently, work has

been done to adapt these same techniques to reinforcement

learning (RL) [33], [34]. This work has mainly focused on

the quantization of the parameters of the critic or value

networks, in order to reduce overall model size and speed up

learning. However, little has been done to evaluate the effect

of applying quantization to a task’s observation space.

III. LEARNING BACKGROUND

Reinforcement learning poses problems as Markov de-

cision processes (MDPs), where an MDP is deﬁned by a

set of observed and hidden states, S, actions, A, stochastic

dynamics, p(st+1|st, at), a reward function r(s, a), and a

discount factor, γ. The RL objective is to compute the policy,

π∗(s, a), that maximizes the expected discounted sum of

future rewards, Es,a (Ptγtrt). We train all learning tasks

with two state-of-the-art on- and off-policy reinforcement

learning algorithms: Proximal Policy Optimization (PPO)

and Soft Actor-Critic (SAC).

A. Proximal Policy Optimization (PPO)

Proximal Policy Optimization [35] is an on-policy, policy

gradient [36] algorithm that employs an actor-critic frame-

work to learn both the optimal policy, π∗(s, a), as well as

the optimal value function, V∗(s). Both the policy πθand

value function Vφare parameterized by neural networks with

weights θand φrespectively. Similar to Trust Region Policy

Optimization (TRPO) [37], PPO stabilizes policy training by

penalizing large policy updates. Additionally, as all updates

are computed with samples taken from the current policy,

PPO often requires high sample complexity.

During training PPO needs to store the parameters of both

its value and policy networks1–usually shallow multilayer

perceptrons (MLPs)–as well as its on-policy rollout buffer

Dk.Dkstores (s, a, r)tuples, which are refreshed during

each iteration of the algorithm.

By combining both the models and rollout buffer, the

general space complexity of PPO can be written as:

size =sizeof(πθ)+sizeof(Vφ)+sizeof(Dk).(1)

1Many PPO implementations save space by using one central MLP with

two additional single-layer policy and value function model heads. We focus

on the standard PPO model approach in this section for clarity.

Importantly, despite throwing out and re-collecting an en-

tirely new rollout buffer during each iteration, Dkdominates

the overall memory footprint of PPO.

For example, Miki et al. [2] uses a student-teacher ap-

proach to bridge the sim-to-real gap and enable robust

quadrupedal locomotion in rough, wilderness terrain. This

process starts by using PPO to train the teacher model–a 3-

layer MLP fed by two smaller encoder networks. Assuming

all parameters are stored as 64-bit ﬂoats, the three models

have a combined size of ∼1.3MB. In comparison, the 391-

dimensional observation space, 16-dimensional action space,

and batch size of 8,300 leads to a rollout buffer of over

27MB of space (assuming 64-bit ﬂoats). This means that Dk

accounts for 95% of the total memory footprint. Furthermore,

this size is dominated by the stored observations, which

account for 96% of the size of Dkand thus 92% of the

total memory footprint.

B. Soft Actor-Critic (SAC)

Soft Actor-Critic [14], [6] is an off-policy RL algorithm

that generally extends soft Q-learning (SQL) [38] and op-

timizes a “maximum entropy” objective, which promotes

exploration according to a temperature parameter, α:

E(st,at)∼π"X

γtr(st, at) + αH(π(·|st))#.(2)

SAC makes a number of improvements on SQL by auto-

matically tuning the temperature parameter αusing double

Q-learning, similar to the Twin Delayed DDPG (TD3) algo-

rithm [39], to correct for overestimation in the Q-function,

and learning not only the Q-functions and the policy, but also

the value function.

Like PPO, SAC’s memory usage comes from its models

and off-policy replay buffer. Shallow MLPs are once again

used to approximate the policy πθas well as the two Q-

functions Qφ1, Qφ2. Unlike PPO, however, the replay buffer,

D, is generally much larger in size as it stores trajectories

from every iteration, usually acting as a size limited queue.

By combining the models and replay buffer, the general

space complexity of SAC can be written as:

size =sizeof(πθ)+2∗sizeof(Qφ) + sizeof(D).

(3)

As in the case of PPO, Ddominates the memory footprint

of SAC. For example, Haarnoja et al. [14] used SAC to train

the quadruped Minitaur to walk. The combined number of

parameters in their policy and two value networks (ignoring

bias terms and again assuming 64-bit ﬂoats) was about

2.3MB. With an 112-dimensional observation space, an 8-

dimensional action space, and a replay buffer of size 1e6,2

Dconsumed about 96.8MB of memory, equating to 97.7% of

the total memory footprint. Again, the size of Dis dominated

by the observations, which account for 92.5% of its size and

90.4% of the total memory footprint.

2A conservative estimate as they collect 100k-200k samples.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

JustRound:QuantizedObservationSpacesEnableMemoryEfcientLearningofDynamicLocomotionLevGrossman1andBrianPlancher2AbstractDeepreinforcementlearning(DRL)isoneofthemostpowerfultoolsforsynthesizingcomplexroboticbehaviors.ButtrainingDRLmodelsisincrediblycomputeandmemoryintensive,requiringlargetrainingdat...

展开>> 收起<<

Just Round Quantized Observation Spaces Enable Memory Efﬁcient Learning of Dynamic Locomotion Lev Grossman1and Brian Plancher2.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Just Round Quantized Observation Spaces Enable Memory Efﬁcient Learning of Dynamic Locomotion Lev Grossman1and Brian Plancher2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: