Demand Layering for Real-Time DNN Inference with Minimized Memory Usage Mingoo Ji12 Saehanseul Yi3 Changjin Koo1 Sol Ahn1 Dongjoo Seo3 Nikil Dutt3 and Jong-Chan Kim14_2

2025-05-06 0 0 1.63MB 14 页 10玖币

侵权投诉

Demand Layering for Real-Time DNN Inference

with Minimized Memory Usage

Mingoo Ji1,2, Saehanseul Yi3, Changjin Koo1, Sol Ahn1, Dongjoo Seo3, Nikil Dutt3, and Jong-Chan Kim1,4

1Graduate School of Automotive Engineering, Kookmin University, Korea

2Automotive Electronics Advanced Development TFT, Hyundai Mobis, Korea

3Department of Computer Science, University of California, Irvine, USA

4Department of Automobile and IT Convergence, Kookmin University, Korea

Correspondence: jongchank@kookmin.ac.kr

Abstract—When executing a deep neural network (DNN), its

model parameters are loaded into GPU memory before execution,

incurring a signiﬁcant GPU memory burden. There are studies

that reduce GPU memory usage by exploiting CPU memory as

a swap device. However, this approach is not applicable in most

embedded systems with integrated GPUs where CPU and GPU

share a common memory. In this regard, we present Demand

Layering, which employs a fast solid-state drive (SSD) as a

co-running partner of a GPU and exploits the layer-by-layer

execution of DNNs. In our approach, a DNN is loaded and

executed in a layer-by-layer manner, minimizing the memory

usage to the order of a single layer. Also, we developed a

pipeline architecture that hides most additional delays caused by

the interleaved parameter loadings alongside layer executions.

Our implementation shows a 96.5% memory reduction with

just 14.8% delay overhead on average for representative DNNs.

Furthermore, by exploiting the memory-delay tradeoff, near-zero

delay overhead (under 1 ms) can be achieved with a slightly

increased memory usage (still an 88.4% reduction), showing the

great potential of Demand Layering.

reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or

reuse of any copyrighted component of this work in other works.

I. INTRODUCTION

To enable efﬁcient deep neural network (DNN) inference

with low-cost embedded hardware, its memory requirement

should be minimized. For that, a typical approach is to

apply pruning and quantization [1]–[3] that reduce the number

of model parameters, however, at the cost of unavoidable

accuracy loss. Once a model is ﬁxed, all the parameters are

loaded into system memory before execution. To the best

of our knowledge, most state-of-the-art DNN frameworks

employ this method despite its excessive memory usage. How-

ever, in the era of large-scale models [4]–[8] and concurrent

DNNs [9]–[11], we argue that this na¨

ıve approach is no longer

viable, and thus a new system approach is needed that can

alleviate this excessive memory requirement.

Recent studies try to reduce the memory usage of DNN

inference by efﬁciently managing activation buffers between

DNN layers [12]–[14]. However, they are not applicable for

storing model parameters. Besides, SwapAdvisor [15] provides

a general method by utilizing inexpensive CPU memory

as a swap device of scarce GPU memory. This method is

promising in discrete GPU (dGPU) systems with separate

CPU and GPU memory. However, most embedded systems

use integrated GPUs (iGPUs), where CPU and GPU share

System Memory System Memory

Preloading On-demand loading

Fig. 1: Preloading vs. Demand Layering.

a common memory system [16]. In such systems, reducing

GPU memory at the cost of increased CPU memory does not

provide any beneﬁt.

With this motivation, this study aims to reduce the mem-

ory usage of iGPU-based DNN inference systems, explicitly

targeting the memory for model parameters. Our idea is

to borrow the concept of demand paging in conventional

operating systems, where program instructions are loaded to

CPU memory on demand in the granularity of pages (typically

sized 4 KB - 16 KB). Similarly, exploiting the layer-by-layer

execution of DNNs, we propose Demand Layering that loads

model parameters on demand in the granularity of layers while

dropping previous layers of no use. In this manner, the memory

requirement is signiﬁcantly reduced to the order of a single

layer from the order of the entire model. Fig. 1 highlights

the difference between the preloading architecture and our

Demand Layering.

However, the memory reduction is not free. It comes at

the cost of increased delays. Thus, we conducted a thorough

delay analysis, which found out that the inference delay can

be analyzed in terms of the following three operations in it:

•Read: Model parameters are read into CPU memory.

•Copy: Model parameters are copied to GPU memory.

•Kernel: DNN layers are executed by GPU kernels.

In the preloading architecture, all the read and copy oper-

ations are only in the initialization phase; thus, its inference

delay is just the sum of GPU kernel executions. In contrast,

Demand Layering repeatedly conducts read and copy oper-

ations during the inference phase, which potentially causes

extra delays. For that, our baseline approach is to employ

a high-performance solid-state drive (SSD). Compared with

arXiv:2210.04024v1 [cs.LG] 8 Oct 2022

eMMC storages typically with 300 MB/s sequential read

performance, M.2 NVMe SSDs provide up to 7000 MB/s of

sequential read performance [17]. Although random reads are

somewhat slower, most DNN model ﬁles exhibit sequential

access patterns due to the inherent sequential nature of DNN

executions [10], [11].

Even with the fastest SSD, extra delays are still signiﬁcant.

Thus, our next approach is to hide away the delays as much

as possible by pipelined execution of read, copy, and kernel

operations. Fortunately, even in iGPU systems, these three

operations can run in parallel, because read operations can

be carried out by CPU while copy and kernel operations are

being processed by GPU. Even better, Nvidia GPUs have two

separate processing units: a copy engine (CE) and an execution

engine (EE). The CE can process copy operations while the EE

is executing GPU kernels [18], [19]. As a result, read, copy,

and kernel operations can run fully in parallel. Based on this

parallel hardware architecture, we developed and evaluated a

number of software pipeline architectures on an Nvidia Jetson

AGX Xavier platform with various DNNs. The remainder

of this section introduces the case with YOLOv4 [20] in

particular, whose model size is 245.8 MB and its average

inference delay in the preloading architecture is 160.8 ms.

Besides, the largest layer size is 18.0 MB.

Synchronous pipeline. In the 3-stage synchronous pipeline

architecture, its read, copy, and kernel stages advance while

synchronized with a common pipeline cycle. Since kernel

operations are usually the longest among the three stages, most

read and copy operations are hidden behind kernel operations.

This pipeline architecture needs two inter-stage buffers: (i) a

CPU memory buffer between read and copy stages and (ii) a

GPU memory buffer between copy and kernel stages. Since

each buffer needs to hold just the layer being processed, the

required buffer size is the size of the largest layer. In addition,

the buffers should be double-buffered because, for example, a

read to the CPU buffer can happen simultaneously with a copy

from the same buffer. The same applies to the GPU buffer.

Our implementation provides an 85.4% memory reduction (to

72.0 MB) with 23.7% delay overhead (to 198.9 ms).

Asynchronous pipeline. If a read operation happens to be

the longest in a synchronous pipeline cycle, it causes a GPU

idling interval, negatively impacting the delay. To minimize

such unwanted delays, our architecture is modiﬁed to an

asynchronous pipeline, where pipeline stages advance at their

own paces [21]. Between the pipeline stages, we introduce

two circular buffers that can barely hold the largest layer

each, instead of the two pairs of double buffers used in the

synchronous architecture, cutting the memory requirement in

half. Our implementation provides a 92.7% memory reduction

(to 36.0 MB) with 12.7% delay overhead (to 181.2 ms).

Two-stage pipeline. Recent iGPU-based system on chips

(SoCs) (e.g., Nvidia Xavier) provide a special memory man-

agement scheme so that a memory buffer can be accessed both

from CPU and GPU [16]. This zero-copy memory eliminates

the need of copy operations, enabling a 2-stage pipeline. By

this architecture, the memory requirement is further reduced to

just the order of a single layer. Our implementation provides

a 96.3% memory reduction (to 18.0 MB) with 21.5% delay

overhead (to 195.3 ms).

Memory-delay tradeoff. In the asynchronous pipeline ar-

chitectures, we can intentionally increase the circular buffer

size to exploit the tradeoff relation between memory and

delay. Thus, we can devise an iterative optimization process

by gradually increasing the buffer size until there is no further

delay reduction. By this optimization method, we can ﬁnd the

minimal delay conﬁguration. As a result, near-zero (<1.0 ms)

delay overhead is achieved by a slight increase in memory

usage (from 18.0 MB to 52.8 MB).

The contributions of this study can be summarized as:

•We propose Demand Layering for minimized memory us-

age in DNN inference systems by loading and executing

layers in a layer-by-layer manner.

•Three pipeline architectures are presented that minimize

the extra delay overhead of Demand Layering.

•The pipeline architectures are implemented and evaluated

on Nvidia Jetson AGX Xavier, showing signiﬁcant mem-

ory reductions with near-zero delay overhead.

II. PRELIMINARIES

A. Deep Neural Networks (DNNs)

In contrast to conventional programs, which are sequences

of instructions, DNNs are sequences of parameters, organized

by layers such as convolutional and fully connected layers.

The parameters are produced in a training phase and stored

in a DNN model ﬁle, whose ﬁle format depends on the DNN

framework of your choice. For example, Darknet [22] uses

.weights binary ﬁles. PyTorch [23] uses .pt or .pth ﬁles,

which are serialized binary ﬁles by the Python pickle module.

TensorFlow [24] uses .pb ﬁles, which are binary ﬁles by the

ProtoBuf format.

Regardless of the ﬁle format, the model ﬁles must be

loaded to GPU memory in the initialization phase. Then, in

the inference phase, the preloaded parameters are interpreted

and executed by a DNN inference framework in a layer-by-

layer manner [9], [10]. This preloading architecture inherently

imposes a signiﬁcant GPU memory burden for storing the

entire model parameters, especially serious in multi-DNN

systems.

B. Integrated CPU-GPU Systems

When designing embedded systems for DNN applications,

iGPUs are highly preferred to dGPUs due to the advantage

in its size, weight, and power (SWaP) properties [16]. In

contrast to dGPUs, iGPUs share the same physical memory

space with CPU. In such systems, GPU memory optimization

at the expense of CPU memory cannot make a beneﬁcial deal.

Instead, a holistic CPU-GPU memory optimization method is

required.

A typical example of integrated CPU-GPU systems is

Nvidia Jetson AGX Xavier, which is our experimental plat-

form. Fig. 2 shows its internal architecture with 16 GB shared

DRAM, an 8-core 64-bit ARM CPU, and a 512-core integrated

512-core GPU8-core CPU

SSD (1 TB)eMMC (32 GB)

Shared DRAM

(16 GB)

PCIe Bridge

L1/L2/L3 Cache L1/L2 Cache

System Bus

PCIe Bus

Fig. 2: Nvidia Jetson AGX Xavier hardware architecture.

DNN

Model File

❶Read

Execution

Engine

Copy

Engine

GPU

❸Kernel

Pageable

Memory

Host-pinned

Memory

Staging Area

❷Copy

CPU Memory GPU Memory

DNN Model

Device

Memory

Optional: only

when copying from

pageable memory

SSD

Shared DRAM

CPU

Core

Fig. 3: Data ﬂow of model parameters.

Volta GPU connected through a system bus. Additionally, it is

equipped with an M.2 NVMe interface through a PCI express

(PCIe) bus that can host an optional SSD besides its built-in

32 GB eMMC storage.

C. Solid-State Drives (SSDs)

For many years, eMMC storages have dominated most

embedded systems since conventional embedded applications

did not require either TB-scale capacity or GB/s-level band-

width. However, in recent data-intensive applications like

autonomous driving, a vast amount of data should be stored

and retrieved in real time, requiring huge storage capacities

and high bandwidth. Since neither of them can be achieved by

eMMC devices, a viable alternative is to employ SSDs in such

data-centric embedded systems. Recent commercial off-the-

shelf (COTS) SSDs can satisfy such excessive requirements

with their ever-growing capacity and bandwidth.

Our experimental platform is also equipped with a Samsung

980 PRO NVMe M.2 SSD with 1 TB capacity and its ofﬁcially

announced 7000 MB/s sequential read performance. The SSD

is connected to both CPU and GPU via a PCIe Gen4 interface.

In our target application (i.e., DNN inference), the SSD is used

to store DNN model ﬁles, which are usually above hundreds of

megabytes. Furthermore, in multi-DNN systems, the storage

requirement is far more signiﬁcant, making SSDs an ideal

choice for storing DNN model ﬁles.

D. Data Flow of DNN Model Parameters

To begin an inference (i.e., forward propagation) on a DNN

model, the entire parameters should be in GPU memory such

that GPU kernels can directly access them. For that, a three-

step approach is usually used, which is depicted in Fig. 3.

The model ﬁle is ﬁrst read from disk to a CPU memory

buffer ( 1). When allocating CPU buffers, there are several

choices provided by the Nvidia CUDA runtime, which will

be detailed in Section IV-B. Then the parameters are copied

to a GPU memory buffer ( 2). When the source CPU buffer

happens to be a pageable memory by the usual malloc()

function that is not under the control of the CUDA runtime,

the copy is done via a hidden staging area, incurring possible

blockings and delays in case of a staging area shortage. After

the copy operation, GPU kernels can access and execute the

DNN layers in the GPU memory buffer ( 3). As explained

in Section II-B, CPU and GPU memory buffers are from

the same shared DRAM space. Thus, both buffers should be

accounted for when estimating the memory usage of a DNN

inference system. The read operation is processed by CPU,

while the copy and kernel operations are executed by GPU.

Since GPUs have two separate processing units for them (i.e.,

copy engine and execution engine), read and copy operations

can run simultaneously [25]. As a result, read, copy, and kernel

operations can run fully in parallel, providing a great chance

for optimizing the DNN execution architecture.

E. Observations and Our Motivation

Meanwhile, we have the following observations during the

investigation on various DNN inference frameworks in GPU-

based embedded systems:

(i) Memory burden in DNN inference. To the best of

our knowledge, most DNN inference frameworks preload the

whole model parameters from disk to memory in the initial-

ization phase to avoid disk operations in the inference phase.

However, this preloading architecture permanently occupies

a signiﬁcant amount of system memory for storing model

parameters, which is not acceptable in resource-constrained

embedded systems.

(ii) Layer-by-layer DNN execution: DNN models have a

layered structure, where there are strict data dependencies

between layers. Model ﬁles are also organized following the

layered architecture. Most notably, when a certain layer is

executing, only that layer’s parameters are accessed, and the

rest of the parameters are irrelevant to the current layer

execution.

(iii) High-performance SSDs. Most recent SSDs are fast

enough, reaching the speed of 7000 MB/s for sequential

reads. Certainly, the speed of random reads is far slower than

sequential reads. Fortunately, however, what we need for the

model ﬁle is only sequential reads that can best extract the

peak performance of SSDs.

Motivation. With the above observations, our intuition is

that the memory usage can be drastically reduced by loading

and unloading model parameters by a layer’s granularity in the

inference phase without preloading them in the initialization

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DemandLayeringforReal-TimeDNNInferencewithMinimizedMemoryUsageMingooJi1;2,SaehanseulYi3,ChangjinKoo1,SolAhn1,DongjooSeo3,NikilDutt3,andJong-ChanKim1;41GraduateSchoolofAutomotiveEngineering,KookminUniversity,Korea2AutomotiveElectronicsAdvancedDevelopmentTFT,HyundaiMobis,Korea3DepartmentofComputerScie...

展开>> 收起<<

Demand Layering for Real-Time DNN Inference with Minimized Memory Usage Mingoo Ji12 Saehanseul Yi3 Changjin Koo1 Sol Ahn1 Dongjoo Seo3 Nikil Dutt3 and Jong-Chan Kim14_2.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Demand Layering for Real-Time DNN Inference with Minimized Memory Usage Mingoo Ji12 Saehanseul Yi3 Changjin Koo1 Sol Ahn1 Dongjoo Seo3 Nikil Dutt3 and Jong-Chan Kim14_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: