eMMC storages typically with 300 MB/s sequential read
performance, M.2 NVMe SSDs provide up to 7000 MB/s of
sequential read performance [17]. Although random reads are
somewhat slower, most DNN model files exhibit sequential
access patterns due to the inherent sequential nature of DNN
executions [10], [11].
Even with the fastest SSD, extra delays are still significant.
Thus, our next approach is to hide away the delays as much
as possible by pipelined execution of read, copy, and kernel
operations. Fortunately, even in iGPU systems, these three
operations can run in parallel, because read operations can
be carried out by CPU while copy and kernel operations are
being processed by GPU. Even better, Nvidia GPUs have two
separate processing units: a copy engine (CE) and an execution
engine (EE). The CE can process copy operations while the EE
is executing GPU kernels [18], [19]. As a result, read, copy,
and kernel operations can run fully in parallel. Based on this
parallel hardware architecture, we developed and evaluated a
number of software pipeline architectures on an Nvidia Jetson
AGX Xavier platform with various DNNs. The remainder
of this section introduces the case with YOLOv4 [20] in
particular, whose model size is 245.8 MB and its average
inference delay in the preloading architecture is 160.8 ms.
Besides, the largest layer size is 18.0 MB.
Synchronous pipeline. In the 3-stage synchronous pipeline
architecture, its read, copy, and kernel stages advance while
synchronized with a common pipeline cycle. Since kernel
operations are usually the longest among the three stages, most
read and copy operations are hidden behind kernel operations.
This pipeline architecture needs two inter-stage buffers: (i) a
CPU memory buffer between read and copy stages and (ii) a
GPU memory buffer between copy and kernel stages. Since
each buffer needs to hold just the layer being processed, the
required buffer size is the size of the largest layer. In addition,
the buffers should be double-buffered because, for example, a
read to the CPU buffer can happen simultaneously with a copy
from the same buffer. The same applies to the GPU buffer.
Our implementation provides an 85.4% memory reduction (to
72.0 MB) with 23.7% delay overhead (to 198.9 ms).
Asynchronous pipeline. If a read operation happens to be
the longest in a synchronous pipeline cycle, it causes a GPU
idling interval, negatively impacting the delay. To minimize
such unwanted delays, our architecture is modified to an
asynchronous pipeline, where pipeline stages advance at their
own paces [21]. Between the pipeline stages, we introduce
two circular buffers that can barely hold the largest layer
each, instead of the two pairs of double buffers used in the
synchronous architecture, cutting the memory requirement in
half. Our implementation provides a 92.7% memory reduction
(to 36.0 MB) with 12.7% delay overhead (to 181.2 ms).
Two-stage pipeline. Recent iGPU-based system on chips
(SoCs) (e.g., Nvidia Xavier) provide a special memory man-
agement scheme so that a memory buffer can be accessed both
from CPU and GPU [16]. This zero-copy memory eliminates
the need of copy operations, enabling a 2-stage pipeline. By
this architecture, the memory requirement is further reduced to
just the order of a single layer. Our implementation provides
a 96.3% memory reduction (to 18.0 MB) with 21.5% delay
overhead (to 195.3 ms).
Memory-delay tradeoff. In the asynchronous pipeline ar-
chitectures, we can intentionally increase the circular buffer
size to exploit the tradeoff relation between memory and
delay. Thus, we can devise an iterative optimization process
by gradually increasing the buffer size until there is no further
delay reduction. By this optimization method, we can find the
minimal delay configuration. As a result, near-zero (<1.0 ms)
delay overhead is achieved by a slight increase in memory
usage (from 18.0 MB to 52.8 MB).
The contributions of this study can be summarized as:
•We propose Demand Layering for minimized memory us-
age in DNN inference systems by loading and executing
layers in a layer-by-layer manner.
•Three pipeline architectures are presented that minimize
the extra delay overhead of Demand Layering.
•The pipeline architectures are implemented and evaluated
on Nvidia Jetson AGX Xavier, showing significant mem-
ory reductions with near-zero delay overhead.
II. PRELIMINARIES
A. Deep Neural Networks (DNNs)
In contrast to conventional programs, which are sequences
of instructions, DNNs are sequences of parameters, organized
by layers such as convolutional and fully connected layers.
The parameters are produced in a training phase and stored
in a DNN model file, whose file format depends on the DNN
framework of your choice. For example, Darknet [22] uses
.weights binary files. PyTorch [23] uses .pt or .pth files,
which are serialized binary files by the Python pickle module.
TensorFlow [24] uses .pb files, which are binary files by the
ProtoBuf format.
Regardless of the file format, the model files must be
loaded to GPU memory in the initialization phase. Then, in
the inference phase, the preloaded parameters are interpreted
and executed by a DNN inference framework in a layer-by-
layer manner [9], [10]. This preloading architecture inherently
imposes a significant GPU memory burden for storing the
entire model parameters, especially serious in multi-DNN
systems.
B. Integrated CPU-GPU Systems
When designing embedded systems for DNN applications,
iGPUs are highly preferred to dGPUs due to the advantage
in its size, weight, and power (SWaP) properties [16]. In
contrast to dGPUs, iGPUs share the same physical memory
space with CPU. In such systems, GPU memory optimization
at the expense of CPU memory cannot make a beneficial deal.
Instead, a holistic CPU-GPU memory optimization method is
required.
A typical example of integrated CPU-GPU systems is
Nvidia Jetson AGX Xavier, which is our experimental plat-
form. Fig. 2 shows its internal architecture with 16 GB shared
DRAM, an 8-core 64-bit ARM CPU, and a 512-core integrated