Deep Learning Inference Frameworks Benchmark

2025-04-22 0 0 337.24KB 8 页 10玖币

侵权投诉

Pierrick Pochelu

pierrick.pochelu@gmail.com

Abstract—Deep learning (DL) has been widely adopted those

last years but they are computing-intensive method. Therefore,

scientists proposed diverse optimization to accelerate their pre-

dictions for end-user applications. However, no single inference

framework currently dominates in terms of performance. This

paper takes a holistic approach to conduct an empirical compari-

son and analysis of four representative DL inference frameworks.

First, given a selection of CPU-GPU conﬁgurations, we show

that for a speciﬁc DL framework, different conﬁgurations of

its settings may have a signiﬁcant impact on the prediction

speed, memory, and computing power. Second, to the best of our

knowledge, this study is the ﬁrst to identify the opportunities

for accelerating the ensemble of co-localized models in the same

GPU. This measurement study provides an in-depth empirical

comparison and analysis of four representative DL frameworks

and offers practical guidance for service providers to deploy and

deliver DL predictions.

Index Terms—Deep learning, neural network, inference system,

software optimization

I. INTRODUCTION

An inference deep learning framework consists in predict-

ing with an already-trained neural network. After training,

the inference frameworks TensorRT [1], ONNX-runtime [2],

OpenVINO [3], Tensorﬂow XLA [4], LLVM MLIR [5] apply

diverse optimizations to accelerate its computing speed.

The last decade shows that bigger deep learning models are

generally more accurate. However, they are also slower and

memory cumbersome. Accelerating their predictions is, there-

fore, crucial for speed-critical applications, and applications

where prediction quality is the priority with reasonable speed.

While the training of DNNs is a time-bounded activity, the

inference is generally deployed and runs for a long time. This

is why optimizing the prediction time of deep neural networks

is sometimes considered more strategic than the training [6].

This paper analyzes the prediction performance of some

frameworks: Tensorﬂow, ONNX, OpenVINO, and TensorRT

benchmarked on diverse computer vision neural networks. For

each framework, we gather more than 80 values including

throughputs (predictions per second), load time, and memory

consumption, power consumption on both GPU and CPU. We

provide a comprehensive review of used optimization technics

and evaluate them to guide the inference framework choice.

The ﬁgures are presented in appendices and the link to the

code is given at the conclusion.

Some benchmarks have been proposed to measure deep

neural network speeds. Some proposed microbenchmarks such

as AI-Matrix [7] aim to measure the speed of operators such

as matrix multiplication and convolution, but they are far from

evaluating the overall complexity of modern neural networks.

Dawn Bench [8] shows that neural networks underutilize cores

due to the bottleneck of the memory transfers. Benchmarks

such as MLPerf Inf [9], and ML Bench [10] are comprehensive

studies of different inference applications, inference frame-

works, and hardware. However, they generally do not provide

an in-depth analysis of the inference framework settings.

First, given a selection of two computing nodes, the deep

learning model, and the inference framework settings are

analyzed. We measure that for a speciﬁc neural network,

different inference frameworks may have a signiﬁcant impact

on prediction speed, memory, and computing power. Second,

to the best of our knowledge, this study is the ﬁrst to

identify the opportunities for accelerating ensembles of co-

localized neural networks in the same GPU which is a more

and more performed deep learning procedure [11] [12]. This

measurement study provides an in-depth empirical comparison

and analysis of four representative DL frameworks and offers

practical guidance for deploying and delivering DL predictions

to the end-user application.

Section II presents the different representations and opti-

mization methods of the inference frameworks. Next, sec-

tion III describes a common API to perform many comparisons

and analyses easier. Section IV explicit the settings and en-

abled optimizations of each framework. Section V, the results

of the experiments break down into four categories: computing

time, memory consumption, power consumption, and loading

time. For the ﬁrst, time benchmarks of an ensemble of neural

networks predicting together are included. The conclusion in

section VI summarize the results, provides insight for the

future, and the link for the code.

II. POST-TRAINING REPRESENTATION AND OPTIMIZATION

Computational graphs provide a global view of operators

(nodes) and tensor exchanges (edges) by avoiding specifying

how each operator must be implemented. Operators are gen-

erally implemented in a low-level code representation.

The post-optimization is an intermediate step between train-

ing and inference to optimize the model representation. It

generally exploits this high-level representation to change the

computational graph, and the code representation to compile

and optimize the low-level code targetting the speciﬁc hard-

ware.

These high-level and low-level optimizations are generally

orthogonal to global optimization, and they can be used

together to obtain further performances.

High-level optimization. Fusing consists of merging mul-

tiple operations in one single kernel launch. For example, it

arXiv:2210.04323v1 [cs.LG] 9 Oct 2022

is possible to merge a sequence of convolutions [13] and a

sequence of dense operations such as “Y=AX+B”.

Fusing removes the need for slow storage of some interme-

diate tensors and improves cache utilization. It also removes

the synchronization barriers between operations maximizing

the utilization of the cores. We may notice that those opti-

mization types may be also beneﬁcial to the training phase

such as XLA [4].

And more, some fusing operations consist of mathematical

simpliﬁcation. For example, the adjacent convolution layer

and batch normalization layer may be merged into one single

convolution layer with changed weights [14].

Other high-level optimizations include constant-folding and

static memory planning which pre-computes graph parts op-

erations and intermediate tensor buffers.

Low-level optimization. Converting the computing graph

into a low-level code optimized for targeted hardware. Those

optimization beneﬁts from decades of knowledge of compila-

tion technics. It includes sub-expression elimination, vector-

ization, loop ordering/tiling/unrolling, threading pattern, and

memory caching/re-using ... Authors propose TVM [15] and

LLVM MLIR [5] low-level compilers and optimizers enriched

with the tensor type.

A critical part of optimizing the performance of a DNN

model is its batch size. It controls the internal cores utilization,

the memory consumption, and the data exchange between

the CPU containing input data and the device supporting the

DNN (if different). The best batch size value is unpredictable,

therefore it is a common practice to scan a range of potential

values.

III. INFERENCE FRAMEWORKS UNIFIED API

Different inference frameworks have different API, this is

why a common Python API is proposed here. The overhead

of calling this function is negligible. It contains two functions

“load(path, params)” and “predict(x)→y”.

“load(path, params)” function loads a previously trained

model from the disk at the location path and applies

post-training with the dictionary parameters params. The

optimized DAG is stored on the targeted GPU given by

params[‘gpuid0]and ‘-1’ for the CPU.

Since post-optimization may take minutes, inference frame-

works generally propose caching the optimized DAG on the

disk. One application of this cached-optimized DAG is the

elasticity of the prediction as a micro-service.

The prediction function is given with f(x)→ywith xthe

data samples and ythe associated predictions. xis split into

batches of size params[‘batch size0]. The batching is done

asynchronously in fto overlap the computing of the prediction

with the DAG and the movements of xand y.

IV. EXPERIMENTAL SETTINGS

Neural networks. ML scientists have developed a variety

of convolutional blocks (e.g., VGG, Residual connect). They

propose different neural networks (e.g., Resnet50, Resnet101,

...) with sometimes the help of automatic tuning technics

to optimize and tune the trade-off between accuracy and

computing efﬁciency. We present four of them in table I.

TABLE I: Four deep neural network architecture. We choose

them diverse in terms of depth, width (rate between #params

and #layers), and density.

#param.

#layers #layers #jumps #param. Jump type

VGG19 7.26M 19 0 138M N/A

ResNet50 0.52M 50 16 26M Additions

DensetNet201 0.1M 201 98 20M Concatenations

EfﬁcientNetB0 0.06M 89 25 5.3M Mult. and Add.

GPU technologies have evolved in computing devices con-

taining thousands of cores and gigabytes of memory, and

modern deep neural networks may not guarantee the use of all

resources in one single GPU. To match the ML community

computing needs, an ideal inference framework should not

only efﬁciently leverage its resources to one DNN but should

be performant to run multiple DNNs at the same time. That

is why propose the ensembles in our thesis. All values in

parentheses are our evaluated test top1-accuracy on Ima-

geNet: VGG19 (69.80%), ResNet50 (74.26%), DenseNet201

(75.11%) and ensemble them by interpolating their predictions

performs 76.92%. We choose not to add efﬁcient-netB0 in the

ensemble due to operations errors with some frameworks.

Our benchmarks are performed on two machines. We do

not distribute inference over multiple GPUs for keeping the

analysis focused on the inference framework. Even if we make

expect performance gain using the map-reduce paradigm.

Machine A is an HGX1 equipped with Tesla V100 SXM2

GPUs containing 5120 Cuda cores running at 1312Mhz-

1530Mhz frequency and 16G of GPU memory. Its CPU is

a dual-socket Intel(R) Xeon(R) CPU E5-2698 v4 with a total

of 80 cores @ 1.2Ghz-3.6Ghz and 512GB of RAM. The total

GPU board power is 300 watts.

Machine B is an in-house machine equipped with NVIDIA

Amper A100 PCI-E GPUs, each one containing 6912 Cuda

cores running at 765Mhz-1410Mhz and 40G of GPU memory.

Its CPU is a single-socket AMD EPYC 7F52 with 16 cores

running at 2.5Ghz-3.5Ghz and 256GB of RAM. The total GPU

board power is 250 watts.

Software stack. All software stack versions are the same

on both machines: machine A and machine B. The assessed

inference framework versions are Tensorﬂow 2.6, TensorRT

8.0, Onnx-runtime 1.10, and OpenVINO 2021. The needed

neural network converters to benchmark those frameworks are

tf2onnx 1.9.3, and LLVM 14.0.

Machine A is running on Ubuntu and we use Python

3.9, Machine B is running on CentOS and uses Python 3.8.

Specialized OS for real-time may improve latency determinism

and latency speed but could lower throughput performance

too. We evaluated MLIR (onnx-mlir 0.2 framework [5]) on

Machine B but it support only ResNet50.

In all our benchmarks Tensorﬂow is accelerated with XLA

[4] (Accelerated Linear Algebra).

Graph optimization settings:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DeepLearningInferenceFrameworksBenchmarkPierrickPochelupierrick.pochelu@gmail.comAbstractDeeplearning(DL)hasbeenwidelyadoptedthoselastyearsbuttheyarecomputing-intensivemethod.Therefore,scientistsproposeddiverseoptimizationtoacceleratetheirpre-dictionsforend-userapplications.However,nosingleinferenc...

展开>> 收起<<

Deep Learning Inference Frameworks Benchmark.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Deep Learning Inference Frameworks Benchmark

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: