OLLA Optimizing the Lifetime and Location of Arrays to Reduce the Memory Usage of Neural Networks Benoit Steiner

2025-05-06 0 0 1MB 14 页 10玖币
侵权投诉
OLLA: Optimizing the Lifetime and Location of Arrays
to Reduce the Memory Usage of Neural Networks
Benoit Steiner*
(FAIR)
Mostafa Elhoushi
(Meta)
Jacob Kahn
(FAIR)
James Hegarty
(Meta)
Abstract
The size of deep neural networks has grown exponentially
in recent years. Unfortunately, hardware devices have not kept
pace with the rapidly increasing memory requirements. To
cope with this, researchers have turned to techniques such as
spilling and recomputation, which increase training time, or
reduced precision and model pruning, which can affect model
accuracy.
We present
OLLA
, an algorithm that optimizes the lifetime
and memory location of the tensors used to train neural net-
works. Our method reduces the memory usage of existing
neural networks, without needing any modification to the mod-
els or their training procedures.
We formulate the problem as a joint integer linear program
(ILP). We present several techniques to simplify the encoding
of the problem, and enable our approach to scale to the size
of state-of-the-art neural networks using an off-the-shelf ILP
solver. We experimentally demonstrate that
OLLA
only takes
minutes if not seconds to allow the training of neural networks
using one-third less memory on average.
OLLA
is available at
https://github.com/facebookresearch/OLLA.
1. Introduction
Scale is a major force behind the accuracy improvements
of machine-learning-based solutions [
9
], and both the depth
and width of deep neural networks (DNN) are expanding ex-
ponentially [
66
] (Figure 1). This inflation in size increases
the memory needed to store the weights of the neural net-
work and the intermediate results (e.g., activations and gradi-
ents) generated during the training process. Compounding the
problem, researchers are training neural networks on larger
inputs, such as high-resolution images [
22
,
71
], video [
28
],
three dimensional point-clouds [
14
], long natural language
sequences [
75
,
15
,
21
], and using larger batch sizes to increase
efficiency [69].
Unfortunately, due to the slowing of Moore’s law, the mem-
ory capacity of hardware has only increased linearly over the
last decade (Figure 2). Thus, the amount of memory available
on the hardware used to train DNNs has not kept pace with
the needs of deep learning. Furthermore, features powered by
machine learning, such as automatic speech recognition [
63
]
or keyboard suggestions [
35
], are being personalized by fine
tuning models on-device. This means that model training is
increasingly being pushed to even more memory constrained
edge devices such as smartphones. As a result, memory is
*Correspondance to benoitsteiner@meta.com
Number of parameters
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
100 1000 10000 100000 1000000
Figure 1: The number of deep neural network parameters has
increased by 100,000 fold over the last 10 years, starting to
grow exponentially around 2016. The x-axis is plotted on a log
scale.
increasingly becoming a bottleneck that hinders progress, and
researchers frequently mention memory scarcity as a limiting
factor that impacts their work [47,36,12,18,15].
The research community has proposed several solutions to
mitigate the problem. Data spilling [
54
] and recomputation
of intermediate results [
43
] relieve memory pressure. Novel
neural network architectures [
40
] use memory more sparingly.
Strategies such as reduced precision training [76,78,45] and
weight pruning [
56
,
24
,
37
] decrease memory requirements.
However, these all come at the cost of decreasing the accuracy
of the model, or increasing the time it takes to train it, or
both [49,7,57].
Popular deep learning frameworks such as PyTorch [
61
]
and TensorFlow [
1
] do not fully utilize the limited memory
available. Similar to traditional dynamic memory allocators
such as tcmalloc [
32
] and jemalloc [
26
], these frameworks
maintain a pool of free blocks of memory at runtime. To serve
memory requests, they look for a large enough memory block
in the memory pool, or allocate it from the physical memory
if none is available. This results in memory fragmentation
when free memory blocks do not exactly match the size of an
allocation request, which occurs frequently.
arXiv:2210.12924v2 [cs.LG] 2 Nov 2022
Memory (GB)
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
0 20 40 60 80
Figure 2: The memory capacity of NVidia datacenter GPUs (in
gigabytes) has only increased tenfold over the last decade,
which has not kept pace with the rapidly increasing size of
deep neural networks. The x-axis is plotted on a linear scale.
Furthermore, DNN frameworks do not optimize tensor life-
times. PyTorch [
61
] executes operations in the order in which
they are defined in the program. TensorFlow [
1
] keeps a queue
of operators that are ready to run, and executes them on a first-
come, first-served basis. As a result, tensors can be allocated
earlier than required, or freed later than necessary, wasting
valuable memory.
Our method overcomes these two limitations of existing
deep learning frameworks. We model the computations per-
formed to train a deep neural network as a dataflow graph
of operations. We analyze this graph to find a topological
ordering of the nodes that adjusts the lifetime of the tensors
generated by these operations to minimize the peak amount
of memory that needs to be allocated (Figure 3). Furthermore,
we find an optimal packing of these tensors, which minimizes
memory fragmentation (Figure 4). We encode these two ob-
jectives as an integer linear program (ILP) that can be solved
quickly by commodity solvers, and present
OLLA
(Optimiza-
tion of the Lifetime and Location of Arrays), our algorithm
for memory-optimal training of neural networks.
In addition to significantly reducing memory usage, our
solution has four key strengths. First, it does not impact the
accuracy of the predictions of the neural networks. Second, it
requires no modification to the neural network or the training
procedure. Third, it doesn’t increase training time. Fourth,
it is orthogonal to and can be combined with other memory
reductions techniques to further reduce the memory needs of
a neural network.
Our work makes the following novel contributions:
v1
e1:10Mb v2
v3
v4
e2:10Mb
e3:20Mb
e4:30Mb
e6:10Mb
e5:5Mb
v1e1, e2, e340Mb
v2e2, e3, e535Mb
v3e2, e4, e545Mb
v4e4, e5, e645Mb
Order #1, Peak=45Mb
v1e1, e2, e340Mb
v3e2, e3, e460Mb
v2e3, e4, e555Mb
v4e4, e5, e645Mb
Order #2, Peak=60Mb
Resident Sets:
Figure 3: Node execution orders can impact peak memory us-
age. Edges are annotated with the size of their corresponding
tensors, and the two feasible node orders are annotated with
the set of edges resident in memory at each step. Running v2
before v3is significantly more memory efficient.
We formulate the problem of finding the lifetime and mem-
ory location of tensors that minimizes the peak memory
required to train neural networks as a joint integer linear
program.
We demonstrate how to leverage domain knowledge to
simplify the ILP formulation, which enables off-the-shelf
solvers to quickly reduce the memory usage of large DNNs.
We study empirically the practicality and effectiveness of
our solution on a wide variety of DNNs, which achieves
average memory savings exceeding 30% in a median time
of less than 10 seconds.
We provide an open source implementation of
OLLA
, which
is available at https://github.com/facebookresearch/OLLA.
2. Background
2.1. Representing Neural Networks as Dataflow Graphs
Deep neural networks can be represented using dataflow
graphs, as pioneered by TensorFlow [
1
]. The nodes of the
graph encode the computations to be performed (e.g. matrix
multiplications, convolutions, activation functions), while the
edges represent the data (aka tensor or array) that is produced
by an operation and transferred to consumer nodes.
Due to the producer-consumer relation between connected
nodes, edges are oriented. Each edge has exactly one source,
which is the operator that generated the corresponding tensor.
Since a tensor can be consumed by more than one node, edges
can have multiple sinks.
Operators can have multiple incoming (aka fanin) edges.
Typically, one of these incoming edges will be the tensor
generated by the previous layer, and another one will be a
weight tensor. Similarly, operators can have multiple outgoing
(aka fanout) edges: while most operations generate a single
output tensor, some may create two or more. Operators with no
fanout edges are used to model the final outputs of the neural
network. Operators without fanin edges can model random
2
number generators, constants, weights, or initial inputs to the
neural network.
In the remainder of this paper, we assume that the graphs
are acyclic. In practice, this is not a significant limitation
since recurrent neural networks such as LSTM [
39
] have been
eclipsed by transformers [
75
]. Furthermore, their loops can be
unrolled to avoid the problem altogether.
2.2. Optimizing Tensor Lifetimes
In order for an operator to run, all its input tensors must be
resident in memory, and its output tensors must have been
allocated so that they can be written to while the node ex-
ecutes. Additionally, to avoid recomputing tensors, once a
tensor is generated it must be preserved in memory until all
its consumers have been run.
We define the resident set
RS(t)
at a given time
t
as the
set of tensors that need to be kept in memory at that point in
the execution of the neural network. It comprises the tensors
in the fanin and fanout of the operator that is scheduled for
execution at timestep
t
, as well as all the other tensors that
were previously generated but need to be kept in memory to
be able to run subsequent operators. The peak resident set is
the largest resident set over the execution of the network.
The order in which nodes are executed impact the lifetime
of the tensors, and therefore the peak working set. Figure 3
illustrates a simple example where changing the operator or-
dering noticeably improves memory usage.
Among all possible node orderings, those prioritizing the
execution of nodes that free large amounts of data while gen-
erating little output data themselves, are likely to be more
efficient. However, as demonstrated in prior works [
5
,
8
],
finding an optimal scheduling for a generic DAG is an NP-
complete problem, which cannot be solved with a simple
greedy approach.
2.3. Optimizing Tensor Locations in Memory
Similar to malloc-style memory allocators, the tensor allo-
cation schemes used by typical deep learning frameworks
operate online and suffers from fragmentation. Indeed, free
memory is often segregated into small blocks and interspersed
by memory allocated to live tensors. As a result, a significant
fraction of the total memory is effectively unusable because
it is divided into pieces that are not large enough to fit a ten-
sor. Figure 4illustrates this phenomenon, and demonstrates
that planning the location of each tensor ahead of time can
significantly reduce the overall peak memory usage.
3. Formulation
We propose to take advantage of the predictability of neural
network computations to proactively
O
ptimize the
L
ifetime
and Location of Arrays (OLLA).
Tensor A
Tensor B
Tensor C
Tensor D
Address Space
Input
Op 1
Op 2
Op 3
Out
Tensor A
Tensor B
Tensor C
Tensor D
Address Space
Input
Op 1
Op 2
Op 3
Out
Fails to
Fit!
Figure 4: Memory fragmentation can cause allocation to fail.
A greedy allocator (top) would not leave any room between
tensor A and B, thus making the space unusable for tensor C
once A is freed. OLLA (bottom) leaves a gap between tensor
A and B to enable the reuse of the memory freed by tensor A
and fits all the tensors in memory.
We formulate the problem of optimizing the ordering of
computations (which determines the tensor lifetimes) and the
location of tensors in memory (which determines the amount
of memory fragmentation) of generic data-flow graphs, in-
cluding those used in neural network training. We encode the
problem as an integer linear program, and use an off-the-shelf
ILP solver to find a solution that minimizes the peak memory
required to run the dataflow graph.
We solve the ILP problem ahead of time, before the training
process starts. This results in a small one-time initial cost,
which can be recouped over the duration of the training: as
free space does not need to be found at runtime, our memory
allocation and deallocation operations are much cheaper than
that of a standard allocator, thus saving some time at each
training step (section 5.7).
3.1. DNN representation
As mentioned in section 2.1, we model a neural network as a
directed acyclic graph G = (V, E) with
n
nodes
V=v1,...,vn
that represent the operators and the neural network, and
m
edges
E=e1,...,em
that encode the tensors exchanged by
operators. The size in bytes of the tensor represented by edge
ei
is denoted as
Si
. The source vertex of edge
e
is denoted
src(e). The set of sink vertices of edge eis denoted snks(e).
The set of edges in the fanout of a node
v
is denoted
f o(v)
,
while the set of edges in its fanin is represented as
f i(v)
. We
will also denote
f i(e)
the set of edges in the fanin of the source
vertex of
e
. We represent by
sib(e)
the siblings to an edge
e
, that is the collection of edges that are driven by the same
source vertex.
We model time as a discrete set of timesteps
T=t1,...,tn
. A
single operation typically runs per timestep, but it is possible
for several operations to execute concurrently. Therefore, we
need at most
n
timesteps to schedule a graph with
n
operations.
3
摘要:

OLLA:OptimizingtheLifetimeandLocationofArraystoReducetheMemoryUsageofNeuralNetworksBenoitSteiner*(FAIR)MostafaElhoushi(Meta)JacobKahn(FAIR)JamesHegarty(Meta)AbstractThesizeofdeepneuralnetworkshasgrownexponentiallyinrecentyears.Unfortunately,hardwaredeviceshavenotkeptpacewiththerapidlyincreasingmemor...

展开>> 收起<<
OLLA Optimizing the Lifetime and Location of Arrays to Reduce the Memory Usage of Neural Networks Benoit Steiner.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:1MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注