OLLA Optimizing the Lifetime and Location of Arrays to Reduce the Memory Usage of Neural Networks Benoit Steiner

2025-05-06 0 0 1MB 14 页 10玖币

侵权投诉

OLLA: Optimizing the Lifetime and Location of Arrays

to Reduce the Memory Usage of Neural Networks

Benoit Steiner*

(FAIR)

Mostafa Elhoushi

(Meta)

Jacob Kahn

(FAIR)

James Hegarty

(Meta)

Abstract

The size of deep neural networks has grown exponentially

in recent years. Unfortunately, hardware devices have not kept

pace with the rapidly increasing memory requirements. To

cope with this, researchers have turned to techniques such as

spilling and recomputation, which increase training time, or

reduced precision and model pruning, which can affect model

accuracy.

We present

OLLA

, an algorithm that optimizes the lifetime

and memory location of the tensors used to train neural net-

works. Our method reduces the memory usage of existing

neural networks, without needing any modiﬁcation to the mod-

els or their training procedures.

We formulate the problem as a joint integer linear program

(ILP). We present several techniques to simplify the encoding

of the problem, and enable our approach to scale to the size

of state-of-the-art neural networks using an off-the-shelf ILP

solver. We experimentally demonstrate that

OLLA

only takes

minutes if not seconds to allow the training of neural networks

using one-third less memory on average.

OLLA

is available at

https://github.com/facebookresearch/OLLA.

1. Introduction

Scale is a major force behind the accuracy improvements

of machine-learning-based solutions [

], and both the depth

and width of deep neural networks (DNN) are expanding ex-

ponentially [

] (Figure 1). This inﬂation in size increases

the memory needed to store the weights of the neural net-

work and the intermediate results (e.g., activations and gradi-

ents) generated during the training process. Compounding the

problem, researchers are training neural networks on larger

inputs, such as high-resolution images [

], video [

three dimensional point-clouds [

], long natural language

sequences [

], and using larger batch sizes to increase

efﬁciency [69].

Unfortunately, due to the slowing of Moore’s law, the mem-

ory capacity of hardware has only increased linearly over the

last decade (Figure 2). Thus, the amount of memory available

on the hardware used to train DNNs has not kept pace with

the needs of deep learning. Furthermore, features powered by

machine learning, such as automatic speech recognition [

]

or keyboard suggestions [

], are being personalized by ﬁne

tuning models on-device. This means that model training is

increasingly being pushed to even more memory constrained

edge devices such as smartphones. As a result, memory is

*Correspondance to benoitsteiner@meta.com

Number of parameters

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

100 1000 10000 100000 1000000

Figure 1: The number of deep neural network parameters has

increased by 100,000 fold over the last 10 years, starting to

grow exponentially around 2016. The x-axis is plotted on a log

scale.

increasingly becoming a bottleneck that hinders progress, and

researchers frequently mention memory scarcity as a limiting

factor that impacts their work [47,36,12,18,15].

The research community has proposed several solutions to

mitigate the problem. Data spilling [

] and recomputation

of intermediate results [

] relieve memory pressure. Novel

neural network architectures [

] use memory more sparingly.

Strategies such as reduced precision training [76,78,45] and

weight pruning [

] decrease memory requirements.

However, these all come at the cost of decreasing the accuracy

of the model, or increasing the time it takes to train it, or

both [49,7,57].

Popular deep learning frameworks such as PyTorch [

]

and TensorFlow [

] do not fully utilize the limited memory

available. Similar to traditional dynamic memory allocators

such as tcmalloc [

] and jemalloc [

], these frameworks

maintain a pool of free blocks of memory at runtime. To serve

memory requests, they look for a large enough memory block

in the memory pool, or allocate it from the physical memory

if none is available. This results in memory fragmentation

when free memory blocks do not exactly match the size of an

allocation request, which occurs frequently.

arXiv:2210.12924v2 [cs.LG] 2 Nov 2022

Memory (GB)

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

0 20 40 60 80

Figure 2: The memory capacity of NVidia datacenter GPUs (in

gigabytes) has only increased tenfold over the last decade,

which has not kept pace with the rapidly increasing size of

deep neural networks. The x-axis is plotted on a linear scale.

Furthermore, DNN frameworks do not optimize tensor life-

times. PyTorch [

] executes operations in the order in which

they are deﬁned in the program. TensorFlow [

] keeps a queue

of operators that are ready to run, and executes them on a ﬁrst-

come, ﬁrst-served basis. As a result, tensors can be allocated

earlier than required, or freed later than necessary, wasting

valuable memory.

Our method overcomes these two limitations of existing

deep learning frameworks. We model the computations per-

formed to train a deep neural network as a dataﬂow graph

of operations. We analyze this graph to ﬁnd a topological

ordering of the nodes that adjusts the lifetime of the tensors

generated by these operations to minimize the peak amount

of memory that needs to be allocated (Figure 3). Furthermore,

we ﬁnd an optimal packing of these tensors, which minimizes

memory fragmentation (Figure 4). We encode these two ob-

jectives as an integer linear program (ILP) that can be solved

quickly by commodity solvers, and present

OLLA

(Optimiza-

tion of the Lifetime and Location of Arrays), our algorithm

for memory-optimal training of neural networks.

In addition to signiﬁcantly reducing memory usage, our

solution has four key strengths. First, it does not impact the

accuracy of the predictions of the neural networks. Second, it

requires no modiﬁcation to the neural network or the training

procedure. Third, it doesn’t increase training time. Fourth,

it is orthogonal to and can be combined with other memory

reductions techniques to further reduce the memory needs of

a neural network.

Our work makes the following novel contributions:

e1:10Mb v2

e2:10Mb

e3:20Mb

e4:30Mb

e6:10Mb

e5:5Mb

v1e1, e2, e340Mb

v2e2, e3, e535Mb

v3e2, e4, e545Mb

v4e4, e5, e645Mb

Order #1, Peak=45Mb

v1e1, e2, e340Mb

v3e2, e3, e460Mb

v2e3, e4, e555Mb

v4e4, e5, e645Mb

Order #2, Peak=60Mb

Resident Sets:

Figure 3: Node execution orders can impact peak memory us-

age. Edges are annotated with the size of their corresponding

tensors, and the two feasible node orders are annotated with

the set of edges resident in memory at each step. Running v2

before v3is signiﬁcantly more memory efﬁcient.

•

We formulate the problem of ﬁnding the lifetime and mem-

ory location of tensors that minimizes the peak memory

required to train neural networks as a joint integer linear

program.

•

We demonstrate how to leverage domain knowledge to

simplify the ILP formulation, which enables off-the-shelf

solvers to quickly reduce the memory usage of large DNNs.

•

We study empirically the practicality and effectiveness of

our solution on a wide variety of DNNs, which achieves

average memory savings exceeding 30% in a median time

of less than 10 seconds.

•

We provide an open source implementation of

OLLA

, which

is available at https://github.com/facebookresearch/OLLA.

2. Background

2.1. Representing Neural Networks as Dataﬂow Graphs

Deep neural networks can be represented using dataﬂow

graphs, as pioneered by TensorFlow [

]. The nodes of the

graph encode the computations to be performed (e.g. matrix

multiplications, convolutions, activation functions), while the

edges represent the data (aka tensor or array) that is produced

by an operation and transferred to consumer nodes.

Due to the producer-consumer relation between connected

nodes, edges are oriented. Each edge has exactly one source,

which is the operator that generated the corresponding tensor.

Since a tensor can be consumed by more than one node, edges

can have multiple sinks.

Operators can have multiple incoming (aka fanin) edges.

Typically, one of these incoming edges will be the tensor

generated by the previous layer, and another one will be a

weight tensor. Similarly, operators can have multiple outgoing

(aka fanout) edges: while most operations generate a single

output tensor, some may create two or more. Operators with no

fanout edges are used to model the ﬁnal outputs of the neural

network. Operators without fanin edges can model random

number generators, constants, weights, or initial inputs to the

neural network.

In the remainder of this paper, we assume that the graphs

are acyclic. In practice, this is not a signiﬁcant limitation

since recurrent neural networks such as LSTM [

] have been

eclipsed by transformers [

]. Furthermore, their loops can be

unrolled to avoid the problem altogether.

2.2. Optimizing Tensor Lifetimes

In order for an operator to run, all its input tensors must be

resident in memory, and its output tensors must have been

allocated so that they can be written to while the node ex-

ecutes. Additionally, to avoid recomputing tensors, once a

tensor is generated it must be preserved in memory until all

its consumers have been run.

We deﬁne the resident set

RS(t)

at a given time

as the

set of tensors that need to be kept in memory at that point in

the execution of the neural network. It comprises the tensors

in the fanin and fanout of the operator that is scheduled for

execution at timestep

, as well as all the other tensors that

were previously generated but need to be kept in memory to

be able to run subsequent operators. The peak resident set is

the largest resident set over the execution of the network.

The order in which nodes are executed impact the lifetime

of the tensors, and therefore the peak working set. Figure 3

illustrates a simple example where changing the operator or-

dering noticeably improves memory usage.

Among all possible node orderings, those prioritizing the

execution of nodes that free large amounts of data while gen-

erating little output data themselves, are likely to be more

efﬁcient. However, as demonstrated in prior works [

ﬁnding an optimal scheduling for a generic DAG is an NP-

complete problem, which cannot be solved with a simple

greedy approach.

2.3. Optimizing Tensor Locations in Memory

Similar to malloc-style memory allocators, the tensor allo-

cation schemes used by typical deep learning frameworks

operate online and suffers from fragmentation. Indeed, free

memory is often segregated into small blocks and interspersed

by memory allocated to live tensors. As a result, a signiﬁcant

fraction of the total memory is effectively unusable because

it is divided into pieces that are not large enough to ﬁt a ten-

sor. Figure 4illustrates this phenomenon, and demonstrates

that planning the location of each tensor ahead of time can

signiﬁcantly reduce the overall peak memory usage.

3. Formulation

We propose to take advantage of the predictability of neural

network computations to proactively

ptimize the

ifetime

and Location of Arrays (OLLA).

Tensor A

Tensor B

Tensor C

Tensor D

Address Space

Input

Op 1

Op 2

Op 3

Out

Tensor A

Tensor B

Tensor C

Tensor D

Address Space

Input

Op 1

Op 2

Op 3

Out

Fails to

Fit!

Figure 4: Memory fragmentation can cause allocation to fail.

A greedy allocator (top) would not leave any room between

tensor A and B, thus making the space unusable for tensor C

once A is freed. OLLA (bottom) leaves a gap between tensor

A and B to enable the reuse of the memory freed by tensor A

and ﬁts all the tensors in memory.

We formulate the problem of optimizing the ordering of

computations (which determines the tensor lifetimes) and the

location of tensors in memory (which determines the amount

of memory fragmentation) of generic data-ﬂow graphs, in-

cluding those used in neural network training. We encode the

problem as an integer linear program, and use an off-the-shelf

ILP solver to ﬁnd a solution that minimizes the peak memory

required to run the dataﬂow graph.

We solve the ILP problem ahead of time, before the training

process starts. This results in a small one-time initial cost,

which can be recouped over the duration of the training: as

free space does not need to be found at runtime, our memory

allocation and deallocation operations are much cheaper than

that of a standard allocator, thus saving some time at each

training step (section 5.7).

3.1. DNN representation

As mentioned in section 2.1, we model a neural network as a

directed acyclic graph G = (V, E) with

nodes

V=v1,...,vn

that represent the operators and the neural network, and

edges

E=e1,...,em

that encode the tensors exchanged by

operators. The size in bytes of the tensor represented by edge

is denoted as

. The source vertex of edge

is denoted

src(e). The set of sink vertices of edge eis denoted snks(e).

The set of edges in the fanout of a node

is denoted

f o(v)

while the set of edges in its fanin is represented as

f i(v)

. We

will also denote

f i(e)

the set of edges in the fanin of the source

vertex of

. We represent by

sib(e)

the siblings to an edge

, that is the collection of edges that are driven by the same

source vertex.

We model time as a discrete set of timesteps

T=t1,...,tn

. A

single operation typically runs per timestep, but it is possible

for several operations to execute concurrently. Therefore, we

need at most

timesteps to schedule a graph with

operations.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OLLA:OptimizingtheLifetimeandLocationofArraystoReducetheMemoryUsageofNeuralNetworksBenoitSteiner*(FAIR)MostafaElhoushi(Meta)JacobKahn(FAIR)JamesHegarty(Meta)AbstractThesizeofdeepneuralnetworkshasgrownexponentiallyinrecentyears.Unfortunately,hardwaredeviceshavenotkeptpacewiththerapidlyincreasingmemor...

展开>> 收起<<

OLLA Optimizing the Lifetime and Location of Arrays to Reduce the Memory Usage of Neural Networks Benoit Steiner.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

OLLA Optimizing the Lifetime and Location of Arrays to Reduce the Memory Usage of Neural Networks Benoit Steiner

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: