Deep Learning Inference Frameworks Benchmark

2025-04-22 0 0 337.24KB 8 页 10玖币
侵权投诉
Deep Learning Inference Frameworks Benchmark
Pierrick Pochelu
pierrick.pochelu@gmail.com
Abstract—Deep learning (DL) has been widely adopted those
last years but they are computing-intensive method. Therefore,
scientists proposed diverse optimization to accelerate their pre-
dictions for end-user applications. However, no single inference
framework currently dominates in terms of performance. This
paper takes a holistic approach to conduct an empirical compari-
son and analysis of four representative DL inference frameworks.
First, given a selection of CPU-GPU configurations, we show
that for a specific DL framework, different configurations of
its settings may have a significant impact on the prediction
speed, memory, and computing power. Second, to the best of our
knowledge, this study is the first to identify the opportunities
for accelerating the ensemble of co-localized models in the same
GPU. This measurement study provides an in-depth empirical
comparison and analysis of four representative DL frameworks
and offers practical guidance for service providers to deploy and
deliver DL predictions.
Index Terms—Deep learning, neural network, inference system,
software optimization
I. INTRODUCTION
An inference deep learning framework consists in predict-
ing with an already-trained neural network. After training,
the inference frameworks TensorRT [1], ONNX-runtime [2],
OpenVINO [3], Tensorflow XLA [4], LLVM MLIR [5] apply
diverse optimizations to accelerate its computing speed.
The last decade shows that bigger deep learning models are
generally more accurate. However, they are also slower and
memory cumbersome. Accelerating their predictions is, there-
fore, crucial for speed-critical applications, and applications
where prediction quality is the priority with reasonable speed.
While the training of DNNs is a time-bounded activity, the
inference is generally deployed and runs for a long time. This
is why optimizing the prediction time of deep neural networks
is sometimes considered more strategic than the training [6].
This paper analyzes the prediction performance of some
frameworks: Tensorflow, ONNX, OpenVINO, and TensorRT
benchmarked on diverse computer vision neural networks. For
each framework, we gather more than 80 values including
throughputs (predictions per second), load time, and memory
consumption, power consumption on both GPU and CPU. We
provide a comprehensive review of used optimization technics
and evaluate them to guide the inference framework choice.
The figures are presented in appendices and the link to the
code is given at the conclusion.
Some benchmarks have been proposed to measure deep
neural network speeds. Some proposed microbenchmarks such
as AI-Matrix [7] aim to measure the speed of operators such
as matrix multiplication and convolution, but they are far from
evaluating the overall complexity of modern neural networks.
Dawn Bench [8] shows that neural networks underutilize cores
due to the bottleneck of the memory transfers. Benchmarks
such as MLPerf Inf [9], and ML Bench [10] are comprehensive
studies of different inference applications, inference frame-
works, and hardware. However, they generally do not provide
an in-depth analysis of the inference framework settings.
First, given a selection of two computing nodes, the deep
learning model, and the inference framework settings are
analyzed. We measure that for a specific neural network,
different inference frameworks may have a significant impact
on prediction speed, memory, and computing power. Second,
to the best of our knowledge, this study is the first to
identify the opportunities for accelerating ensembles of co-
localized neural networks in the same GPU which is a more
and more performed deep learning procedure [11] [12]. This
measurement study provides an in-depth empirical comparison
and analysis of four representative DL frameworks and offers
practical guidance for deploying and delivering DL predictions
to the end-user application.
Section II presents the different representations and opti-
mization methods of the inference frameworks. Next, sec-
tion III describes a common API to perform many comparisons
and analyses easier. Section IV explicit the settings and en-
abled optimizations of each framework. Section V, the results
of the experiments break down into four categories: computing
time, memory consumption, power consumption, and loading
time. For the first, time benchmarks of an ensemble of neural
networks predicting together are included. The conclusion in
section VI summarize the results, provides insight for the
future, and the link for the code.
II. POST-TRAINING REPRESENTATION AND OPTIMIZATION
Computational graphs provide a global view of operators
(nodes) and tensor exchanges (edges) by avoiding specifying
how each operator must be implemented. Operators are gen-
erally implemented in a low-level code representation.
The post-optimization is an intermediate step between train-
ing and inference to optimize the model representation. It
generally exploits this high-level representation to change the
computational graph, and the code representation to compile
and optimize the low-level code targetting the specific hard-
ware.
These high-level and low-level optimizations are generally
orthogonal to global optimization, and they can be used
together to obtain further performances.
High-level optimization. Fusing consists of merging mul-
tiple operations in one single kernel launch. For example, it
arXiv:2210.04323v1 [cs.LG] 9 Oct 2022
is possible to merge a sequence of convolutions [13] and a
sequence of dense operations such as “Y=AX+B”.
Fusing removes the need for slow storage of some interme-
diate tensors and improves cache utilization. It also removes
the synchronization barriers between operations maximizing
the utilization of the cores. We may notice that those opti-
mization types may be also beneficial to the training phase
such as XLA [4].
And more, some fusing operations consist of mathematical
simplification. For example, the adjacent convolution layer
and batch normalization layer may be merged into one single
convolution layer with changed weights [14].
Other high-level optimizations include constant-folding and
static memory planning which pre-computes graph parts op-
erations and intermediate tensor buffers.
Low-level optimization. Converting the computing graph
into a low-level code optimized for targeted hardware. Those
optimization benefits from decades of knowledge of compila-
tion technics. It includes sub-expression elimination, vector-
ization, loop ordering/tiling/unrolling, threading pattern, and
memory caching/re-using ... Authors propose TVM [15] and
LLVM MLIR [5] low-level compilers and optimizers enriched
with the tensor type.
A critical part of optimizing the performance of a DNN
model is its batch size. It controls the internal cores utilization,
the memory consumption, and the data exchange between
the CPU containing input data and the device supporting the
DNN (if different). The best batch size value is unpredictable,
therefore it is a common practice to scan a range of potential
values.
III. INFERENCE FRAMEWORKS UNIFIED API
Different inference frameworks have different API, this is
why a common Python API is proposed here. The overhead
of calling this function is negligible. It contains two functions
load(path, params)” and “predict(x)y”.
load(path, params)” function loads a previously trained
model from the disk at the location path and applies
post-training with the dictionary parameters params. The
optimized DAG is stored on the targeted GPU given by
params[‘gpuid0]and ‘-1’ for the CPU.
Since post-optimization may take minutes, inference frame-
works generally propose caching the optimized DAG on the
disk. One application of this cached-optimized DAG is the
elasticity of the prediction as a micro-service.
The prediction function is given with f(x)ywith xthe
data samples and ythe associated predictions. xis split into
batches of size params[‘batch size0]. The batching is done
asynchronously in fto overlap the computing of the prediction
with the DAG and the movements of xand y.
IV. EXPERIMENTAL SETTINGS
Neural networks. ML scientists have developed a variety
of convolutional blocks (e.g., VGG, Residual connect). They
propose different neural networks (e.g., Resnet50, Resnet101,
...) with sometimes the help of automatic tuning technics
to optimize and tune the trade-off between accuracy and
computing efficiency. We present four of them in table I.
TABLE I: Four deep neural network architecture. We choose
them diverse in terms of depth, width (rate between #params
and #layers), and density.
#param.
#layers #layers #jumps #param. Jump type
VGG19 7.26M 19 0 138M N/A
ResNet50 0.52M 50 16 26M Additions
DensetNet201 0.1M 201 98 20M Concatenations
EfficientNetB0 0.06M 89 25 5.3M Mult. and Add.
GPU technologies have evolved in computing devices con-
taining thousands of cores and gigabytes of memory, and
modern deep neural networks may not guarantee the use of all
resources in one single GPU. To match the ML community
computing needs, an ideal inference framework should not
only efficiently leverage its resources to one DNN but should
be performant to run multiple DNNs at the same time. That
is why propose the ensembles in our thesis. All values in
parentheses are our evaluated test top1-accuracy on Ima-
geNet: VGG19 (69.80%), ResNet50 (74.26%), DenseNet201
(75.11%) and ensemble them by interpolating their predictions
performs 76.92%. We choose not to add efficient-netB0 in the
ensemble due to operations errors with some frameworks.
Our benchmarks are performed on two machines. We do
not distribute inference over multiple GPUs for keeping the
analysis focused on the inference framework. Even if we make
expect performance gain using the map-reduce paradigm.
Machine A is an HGX1 equipped with Tesla V100 SXM2
GPUs containing 5120 Cuda cores running at 1312Mhz-
1530Mhz frequency and 16G of GPU memory. Its CPU is
a dual-socket Intel(R) Xeon(R) CPU E5-2698 v4 with a total
of 80 cores @ 1.2Ghz-3.6Ghz and 512GB of RAM. The total
GPU board power is 300 watts.
Machine B is an in-house machine equipped with NVIDIA
Amper A100 PCI-E GPUs, each one containing 6912 Cuda
cores running at 765Mhz-1410Mhz and 40G of GPU memory.
Its CPU is a single-socket AMD EPYC 7F52 with 16 cores
running at 2.5Ghz-3.5Ghz and 256GB of RAM. The total GPU
board power is 250 watts.
Software stack. All software stack versions are the same
on both machines: machine A and machine B. The assessed
inference framework versions are Tensorflow 2.6, TensorRT
8.0, Onnx-runtime 1.10, and OpenVINO 2021. The needed
neural network converters to benchmark those frameworks are
tf2onnx 1.9.3, and LLVM 14.0.
Machine A is running on Ubuntu and we use Python
3.9, Machine B is running on CentOS and uses Python 3.8.
Specialized OS for real-time may improve latency determinism
and latency speed but could lower throughput performance
too. We evaluated MLIR (onnx-mlir 0.2 framework [5]) on
Machine B but it support only ResNet50.
In all our benchmarks Tensorflow is accelerated with XLA
[4] (Accelerated Linear Algebra).
Graph optimization settings:
摘要:

DeepLearningInferenceFrameworksBenchmarkPierrickPochelupierrick.pochelu@gmail.comAbstract—Deeplearning(DL)hasbeenwidelyadoptedthoselastyearsbuttheyarecomputing-intensivemethod.Therefore,scientistsproposeddiverseoptimizationtoacceleratetheirpre-dictionsforend-userapplications.However,nosingleinferenc...

展开>> 收起<<
Deep Learning Inference Frameworks Benchmark.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:337.24KB 格式:PDF 时间:2025-04-22

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注