
number generators, constants, weights, or initial inputs to the
neural network.
In the remainder of this paper, we assume that the graphs
are acyclic. In practice, this is not a significant limitation
since recurrent neural networks such as LSTM [
39
] have been
eclipsed by transformers [
75
]. Furthermore, their loops can be
unrolled to avoid the problem altogether.
2.2. Optimizing Tensor Lifetimes
In order for an operator to run, all its input tensors must be
resident in memory, and its output tensors must have been
allocated so that they can be written to while the node ex-
ecutes. Additionally, to avoid recomputing tensors, once a
tensor is generated it must be preserved in memory until all
its consumers have been run.
We define the resident set
RS(t)
at a given time
t
as the
set of tensors that need to be kept in memory at that point in
the execution of the neural network. It comprises the tensors
in the fanin and fanout of the operator that is scheduled for
execution at timestep
t
, as well as all the other tensors that
were previously generated but need to be kept in memory to
be able to run subsequent operators. The peak resident set is
the largest resident set over the execution of the network.
The order in which nodes are executed impact the lifetime
of the tensors, and therefore the peak working set. Figure 3
illustrates a simple example where changing the operator or-
dering noticeably improves memory usage.
Among all possible node orderings, those prioritizing the
execution of nodes that free large amounts of data while gen-
erating little output data themselves, are likely to be more
efficient. However, as demonstrated in prior works [
5
,
8
],
finding an optimal scheduling for a generic DAG is an NP-
complete problem, which cannot be solved with a simple
greedy approach.
2.3. Optimizing Tensor Locations in Memory
Similar to malloc-style memory allocators, the tensor allo-
cation schemes used by typical deep learning frameworks
operate online and suffers from fragmentation. Indeed, free
memory is often segregated into small blocks and interspersed
by memory allocated to live tensors. As a result, a significant
fraction of the total memory is effectively unusable because
it is divided into pieces that are not large enough to fit a ten-
sor. Figure 4illustrates this phenomenon, and demonstrates
that planning the location of each tensor ahead of time can
significantly reduce the overall peak memory usage.
3. Formulation
We propose to take advantage of the predictability of neural
network computations to proactively
O
ptimize the
L
ifetime
and Location of Arrays (OLLA).
Tensor A
Tensor B
Tensor C
Tensor D
Address Space
Input
Op 1
Op 2
Op 3
Out
Tensor A
Tensor B
Tensor C
Tensor D
Address Space
Input
Op 1
Op 2
Op 3
Out
Fails to
Fit!
Figure 4: Memory fragmentation can cause allocation to fail.
A greedy allocator (top) would not leave any room between
tensor A and B, thus making the space unusable for tensor C
once A is freed. OLLA (bottom) leaves a gap between tensor
A and B to enable the reuse of the memory freed by tensor A
and fits all the tensors in memory.
We formulate the problem of optimizing the ordering of
computations (which determines the tensor lifetimes) and the
location of tensors in memory (which determines the amount
of memory fragmentation) of generic data-flow graphs, in-
cluding those used in neural network training. We encode the
problem as an integer linear program, and use an off-the-shelf
ILP solver to find a solution that minimizes the peak memory
required to run the dataflow graph.
We solve the ILP problem ahead of time, before the training
process starts. This results in a small one-time initial cost,
which can be recouped over the duration of the training: as
free space does not need to be found at runtime, our memory
allocation and deallocation operations are much cheaper than
that of a standard allocator, thus saving some time at each
training step (section 5.7).
3.1. DNN representation
As mentioned in section 2.1, we model a neural network as a
directed acyclic graph G = (V, E) with
n
nodes
V=v1,...,vn
that represent the operators and the neural network, and
m
edges
E=e1,...,em
that encode the tensors exchanged by
operators. The size in bytes of the tensor represented by edge
ei
is denoted as
Si
. The source vertex of edge
e
is denoted
src(e). The set of sink vertices of edge eis denoted snks(e).
The set of edges in the fanout of a node
v
is denoted
f o(v)
,
while the set of edges in its fanin is represented as
f i(v)
. We
will also denote
f i(e)
the set of edges in the fanin of the source
vertex of
e
. We represent by
sib(e)
the siblings to an edge
e
, that is the collection of edges that are driven by the same
source vertex.
We model time as a discrete set of timesteps
T=t1,...,tn
. A
single operation typically runs per timestep, but it is possible
for several operations to execute concurrently. Therefore, we
need at most
n
timesteps to schedule a graph with
n
operations.
3