Demand Layering for Real-Time DNN Inference with Minimized Memory Usage Mingoo Ji12 Saehanseul Yi3 Changjin Koo1 Sol Ahn1 Dongjoo Seo3 Nikil Dutt3 and Jong-Chan Kim14_2

2025-05-06 0 0 1.63MB 14 页 10玖币
侵权投诉
Demand Layering for Real-Time DNN Inference
with Minimized Memory Usage
Mingoo Ji1,2, Saehanseul Yi3, Changjin Koo1, Sol Ahn1, Dongjoo Seo3, Nikil Dutt3, and Jong-Chan Kim1,4
1Graduate School of Automotive Engineering, Kookmin University, Korea
2Automotive Electronics Advanced Development TFT, Hyundai Mobis, Korea
3Department of Computer Science, University of California, Irvine, USA
4Department of Automobile and IT Convergence, Kookmin University, Korea
Correspondence: jongchank@kookmin.ac.kr
Abstract—When executing a deep neural network (DNN), its
model parameters are loaded into GPU memory before execution,
incurring a significant GPU memory burden. There are studies
that reduce GPU memory usage by exploiting CPU memory as
a swap device. However, this approach is not applicable in most
embedded systems with integrated GPUs where CPU and GPU
share a common memory. In this regard, we present Demand
Layering, which employs a fast solid-state drive (SSD) as a
co-running partner of a GPU and exploits the layer-by-layer
execution of DNNs. In our approach, a DNN is loaded and
executed in a layer-by-layer manner, minimizing the memory
usage to the order of a single layer. Also, we developed a
pipeline architecture that hides most additional delays caused by
the interleaved parameter loadings alongside layer executions.
Our implementation shows a 96.5% memory reduction with
just 14.8% delay overhead on average for representative DNNs.
Furthermore, by exploiting the memory-delay tradeoff, near-zero
delay overhead (under 1 ms) can be achieved with a slightly
increased memory usage (still an 88.4% reduction), showing the
great potential of Demand Layering.
© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or
reuse of any copyrighted component of this work in other works.
I. INTRODUCTION
To enable efficient deep neural network (DNN) inference
with low-cost embedded hardware, its memory requirement
should be minimized. For that, a typical approach is to
apply pruning and quantization [1]–[3] that reduce the number
of model parameters, however, at the cost of unavoidable
accuracy loss. Once a model is fixed, all the parameters are
loaded into system memory before execution. To the best
of our knowledge, most state-of-the-art DNN frameworks
employ this method despite its excessive memory usage. How-
ever, in the era of large-scale models [4]–[8] and concurrent
DNNs [9]–[11], we argue that this na¨
ıve approach is no longer
viable, and thus a new system approach is needed that can
alleviate this excessive memory requirement.
Recent studies try to reduce the memory usage of DNN
inference by efficiently managing activation buffers between
DNN layers [12]–[14]. However, they are not applicable for
storing model parameters. Besides, SwapAdvisor [15] provides
a general method by utilizing inexpensive CPU memory
as a swap device of scarce GPU memory. This method is
promising in discrete GPU (dGPU) systems with separate
CPU and GPU memory. However, most embedded systems
use integrated GPUs (iGPUs), where CPU and GPU share
System Memory System Memory
Preloading On-demand loading
Fig. 1: Preloading vs. Demand Layering.
a common memory system [16]. In such systems, reducing
GPU memory at the cost of increased CPU memory does not
provide any benefit.
With this motivation, this study aims to reduce the mem-
ory usage of iGPU-based DNN inference systems, explicitly
targeting the memory for model parameters. Our idea is
to borrow the concept of demand paging in conventional
operating systems, where program instructions are loaded to
CPU memory on demand in the granularity of pages (typically
sized 4 KB - 16 KB). Similarly, exploiting the layer-by-layer
execution of DNNs, we propose Demand Layering that loads
model parameters on demand in the granularity of layers while
dropping previous layers of no use. In this manner, the memory
requirement is significantly reduced to the order of a single
layer from the order of the entire model. Fig. 1 highlights
the difference between the preloading architecture and our
Demand Layering.
However, the memory reduction is not free. It comes at
the cost of increased delays. Thus, we conducted a thorough
delay analysis, which found out that the inference delay can
be analyzed in terms of the following three operations in it:
Read: Model parameters are read into CPU memory.
Copy: Model parameters are copied to GPU memory.
Kernel: DNN layers are executed by GPU kernels.
In the preloading architecture, all the read and copy oper-
ations are only in the initialization phase; thus, its inference
delay is just the sum of GPU kernel executions. In contrast,
Demand Layering repeatedly conducts read and copy oper-
ations during the inference phase, which potentially causes
extra delays. For that, our baseline approach is to employ
a high-performance solid-state drive (SSD). Compared with
arXiv:2210.04024v1 [cs.LG] 8 Oct 2022
eMMC storages typically with 300 MB/s sequential read
performance, M.2 NVMe SSDs provide up to 7000 MB/s of
sequential read performance [17]. Although random reads are
somewhat slower, most DNN model files exhibit sequential
access patterns due to the inherent sequential nature of DNN
executions [10], [11].
Even with the fastest SSD, extra delays are still significant.
Thus, our next approach is to hide away the delays as much
as possible by pipelined execution of read, copy, and kernel
operations. Fortunately, even in iGPU systems, these three
operations can run in parallel, because read operations can
be carried out by CPU while copy and kernel operations are
being processed by GPU. Even better, Nvidia GPUs have two
separate processing units: a copy engine (CE) and an execution
engine (EE). The CE can process copy operations while the EE
is executing GPU kernels [18], [19]. As a result, read, copy,
and kernel operations can run fully in parallel. Based on this
parallel hardware architecture, we developed and evaluated a
number of software pipeline architectures on an Nvidia Jetson
AGX Xavier platform with various DNNs. The remainder
of this section introduces the case with YOLOv4 [20] in
particular, whose model size is 245.8 MB and its average
inference delay in the preloading architecture is 160.8 ms.
Besides, the largest layer size is 18.0 MB.
Synchronous pipeline. In the 3-stage synchronous pipeline
architecture, its read, copy, and kernel stages advance while
synchronized with a common pipeline cycle. Since kernel
operations are usually the longest among the three stages, most
read and copy operations are hidden behind kernel operations.
This pipeline architecture needs two inter-stage buffers: (i) a
CPU memory buffer between read and copy stages and (ii) a
GPU memory buffer between copy and kernel stages. Since
each buffer needs to hold just the layer being processed, the
required buffer size is the size of the largest layer. In addition,
the buffers should be double-buffered because, for example, a
read to the CPU buffer can happen simultaneously with a copy
from the same buffer. The same applies to the GPU buffer.
Our implementation provides an 85.4% memory reduction (to
72.0 MB) with 23.7% delay overhead (to 198.9 ms).
Asynchronous pipeline. If a read operation happens to be
the longest in a synchronous pipeline cycle, it causes a GPU
idling interval, negatively impacting the delay. To minimize
such unwanted delays, our architecture is modified to an
asynchronous pipeline, where pipeline stages advance at their
own paces [21]. Between the pipeline stages, we introduce
two circular buffers that can barely hold the largest layer
each, instead of the two pairs of double buffers used in the
synchronous architecture, cutting the memory requirement in
half. Our implementation provides a 92.7% memory reduction
(to 36.0 MB) with 12.7% delay overhead (to 181.2 ms).
Two-stage pipeline. Recent iGPU-based system on chips
(SoCs) (e.g., Nvidia Xavier) provide a special memory man-
agement scheme so that a memory buffer can be accessed both
from CPU and GPU [16]. This zero-copy memory eliminates
the need of copy operations, enabling a 2-stage pipeline. By
this architecture, the memory requirement is further reduced to
just the order of a single layer. Our implementation provides
a 96.3% memory reduction (to 18.0 MB) with 21.5% delay
overhead (to 195.3 ms).
Memory-delay tradeoff. In the asynchronous pipeline ar-
chitectures, we can intentionally increase the circular buffer
size to exploit the tradeoff relation between memory and
delay. Thus, we can devise an iterative optimization process
by gradually increasing the buffer size until there is no further
delay reduction. By this optimization method, we can find the
minimal delay configuration. As a result, near-zero (<1.0 ms)
delay overhead is achieved by a slight increase in memory
usage (from 18.0 MB to 52.8 MB).
The contributions of this study can be summarized as:
We propose Demand Layering for minimized memory us-
age in DNN inference systems by loading and executing
layers in a layer-by-layer manner.
Three pipeline architectures are presented that minimize
the extra delay overhead of Demand Layering.
The pipeline architectures are implemented and evaluated
on Nvidia Jetson AGX Xavier, showing significant mem-
ory reductions with near-zero delay overhead.
II. PRELIMINARIES
A. Deep Neural Networks (DNNs)
In contrast to conventional programs, which are sequences
of instructions, DNNs are sequences of parameters, organized
by layers such as convolutional and fully connected layers.
The parameters are produced in a training phase and stored
in a DNN model file, whose file format depends on the DNN
framework of your choice. For example, Darknet [22] uses
.weights binary files. PyTorch [23] uses .pt or .pth files,
which are serialized binary files by the Python pickle module.
TensorFlow [24] uses .pb files, which are binary files by the
ProtoBuf format.
Regardless of the file format, the model files must be
loaded to GPU memory in the initialization phase. Then, in
the inference phase, the preloaded parameters are interpreted
and executed by a DNN inference framework in a layer-by-
layer manner [9], [10]. This preloading architecture inherently
imposes a significant GPU memory burden for storing the
entire model parameters, especially serious in multi-DNN
systems.
B. Integrated CPU-GPU Systems
When designing embedded systems for DNN applications,
iGPUs are highly preferred to dGPUs due to the advantage
in its size, weight, and power (SWaP) properties [16]. In
contrast to dGPUs, iGPUs share the same physical memory
space with CPU. In such systems, GPU memory optimization
at the expense of CPU memory cannot make a beneficial deal.
Instead, a holistic CPU-GPU memory optimization method is
required.
A typical example of integrated CPU-GPU systems is
Nvidia Jetson AGX Xavier, which is our experimental plat-
form. Fig. 2 shows its internal architecture with 16 GB shared
DRAM, an 8-core 64-bit ARM CPU, and a 512-core integrated
512-core GPU8-core CPU
SSD (1 TB)eMMC (32 GB)
Shared DRAM
(16 GB)
PCIe Bridge
L1/L2/L3 Cache L1/L2 Cache
System Bus
PCIe Bus
Fig. 2: Nvidia Jetson AGX Xavier hardware architecture.
DNN
Model File
Read
Execution
Engine
Copy
Engine
GPU
Kernel
Pageable
Memory
Host-pinned
Memory
Staging Area
Copy
CPU Memory GPU Memory
DNN Model
Device
Memory
Optional: only
when copying from
pageable memory
SSD
Shared DRAM
CPU
Core
Fig. 3: Data flow of model parameters.
Volta GPU connected through a system bus. Additionally, it is
equipped with an M.2 NVMe interface through a PCI express
(PCIe) bus that can host an optional SSD besides its built-in
32 GB eMMC storage.
C. Solid-State Drives (SSDs)
For many years, eMMC storages have dominated most
embedded systems since conventional embedded applications
did not require either TB-scale capacity or GB/s-level band-
width. However, in recent data-intensive applications like
autonomous driving, a vast amount of data should be stored
and retrieved in real time, requiring huge storage capacities
and high bandwidth. Since neither of them can be achieved by
eMMC devices, a viable alternative is to employ SSDs in such
data-centric embedded systems. Recent commercial off-the-
shelf (COTS) SSDs can satisfy such excessive requirements
with their ever-growing capacity and bandwidth.
Our experimental platform is also equipped with a Samsung
980 PRO NVMe M.2 SSD with 1 TB capacity and its officially
announced 7000 MB/s sequential read performance. The SSD
is connected to both CPU and GPU via a PCIe Gen4 interface.
In our target application (i.e., DNN inference), the SSD is used
to store DNN model files, which are usually above hundreds of
megabytes. Furthermore, in multi-DNN systems, the storage
requirement is far more significant, making SSDs an ideal
choice for storing DNN model files.
D. Data Flow of DNN Model Parameters
To begin an inference (i.e., forward propagation) on a DNN
model, the entire parameters should be in GPU memory such
that GPU kernels can directly access them. For that, a three-
step approach is usually used, which is depicted in Fig. 3.
The model file is first read from disk to a CPU memory
buffer ( 1). When allocating CPU buffers, there are several
choices provided by the Nvidia CUDA runtime, which will
be detailed in Section IV-B. Then the parameters are copied
to a GPU memory buffer ( 2). When the source CPU buffer
happens to be a pageable memory by the usual malloc()
function that is not under the control of the CUDA runtime,
the copy is done via a hidden staging area, incurring possible
blockings and delays in case of a staging area shortage. After
the copy operation, GPU kernels can access and execute the
DNN layers in the GPU memory buffer ( 3). As explained
in Section II-B, CPU and GPU memory buffers are from
the same shared DRAM space. Thus, both buffers should be
accounted for when estimating the memory usage of a DNN
inference system. The read operation is processed by CPU,
while the copy and kernel operations are executed by GPU.
Since GPUs have two separate processing units for them (i.e.,
copy engine and execution engine), read and copy operations
can run simultaneously [25]. As a result, read, copy, and kernel
operations can run fully in parallel, providing a great chance
for optimizing the DNN execution architecture.
E. Observations and Our Motivation
Meanwhile, we have the following observations during the
investigation on various DNN inference frameworks in GPU-
based embedded systems:
(i) Memory burden in DNN inference. To the best of
our knowledge, most DNN inference frameworks preload the
whole model parameters from disk to memory in the initial-
ization phase to avoid disk operations in the inference phase.
However, this preloading architecture permanently occupies
a significant amount of system memory for storing model
parameters, which is not acceptable in resource-constrained
embedded systems.
(ii) Layer-by-layer DNN execution: DNN models have a
layered structure, where there are strict data dependencies
between layers. Model files are also organized following the
layered architecture. Most notably, when a certain layer is
executing, only that layer’s parameters are accessed, and the
rest of the parameters are irrelevant to the current layer
execution.
(iii) High-performance SSDs. Most recent SSDs are fast
enough, reaching the speed of 7000 MB/s for sequential
reads. Certainly, the speed of random reads is far slower than
sequential reads. Fortunately, however, what we need for the
model file is only sequential reads that can best extract the
peak performance of SSDs.
Motivation. With the above observations, our intuition is
that the memory usage can be drastically reduced by loading
and unloading model parameters by a layer’s granularity in the
inference phase without preloading them in the initialization
摘要:

DemandLayeringforReal-TimeDNNInferencewithMinimizedMemoryUsageMingooJi1;2,SaehanseulYi3,ChangjinKoo1,SolAhn1,DongjooSeo3,NikilDutt3,andJong-ChanKim1;41GraduateSchoolofAutomotiveEngineering,KookminUniversity,Korea2AutomotiveElectronicsAdvancedDevelopmentTFT,HyundaiMobis,Korea3DepartmentofComputerScie...

展开>> 收起<<
Demand Layering for Real-Time DNN Inference with Minimized Memory Usage Mingoo Ji12 Saehanseul Yi3 Changjin Koo1 Sol Ahn1 Dongjoo Seo3 Nikil Dutt3 and Jong-Chan Kim14_2.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:14 页 大小:1.63MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注