
is possible to merge a sequence of convolutions [13] and a
sequence of dense operations such as “Y=AX+B”.
Fusing removes the need for slow storage of some interme-
diate tensors and improves cache utilization. It also removes
the synchronization barriers between operations maximizing
the utilization of the cores. We may notice that those opti-
mization types may be also beneficial to the training phase
such as XLA [4].
And more, some fusing operations consist of mathematical
simplification. For example, the adjacent convolution layer
and batch normalization layer may be merged into one single
convolution layer with changed weights [14].
Other high-level optimizations include constant-folding and
static memory planning which pre-computes graph parts op-
erations and intermediate tensor buffers.
Low-level optimization. Converting the computing graph
into a low-level code optimized for targeted hardware. Those
optimization benefits from decades of knowledge of compila-
tion technics. It includes sub-expression elimination, vector-
ization, loop ordering/tiling/unrolling, threading pattern, and
memory caching/re-using ... Authors propose TVM [15] and
LLVM MLIR [5] low-level compilers and optimizers enriched
with the tensor type.
A critical part of optimizing the performance of a DNN
model is its batch size. It controls the internal cores utilization,
the memory consumption, and the data exchange between
the CPU containing input data and the device supporting the
DNN (if different). The best batch size value is unpredictable,
therefore it is a common practice to scan a range of potential
values.
III. INFERENCE FRAMEWORKS UNIFIED API
Different inference frameworks have different API, this is
why a common Python API is proposed here. The overhead
of calling this function is negligible. It contains two functions
“load(path, params)” and “predict(x)→y”.
“load(path, params)” function loads a previously trained
model from the disk at the location path and applies
post-training with the dictionary parameters params. The
optimized DAG is stored on the targeted GPU given by
params[‘gpuid0]and ‘-1’ for the CPU.
Since post-optimization may take minutes, inference frame-
works generally propose caching the optimized DAG on the
disk. One application of this cached-optimized DAG is the
elasticity of the prediction as a micro-service.
The prediction function is given with f(x)→ywith xthe
data samples and ythe associated predictions. xis split into
batches of size params[‘batch size0]. The batching is done
asynchronously in fto overlap the computing of the prediction
with the DAG and the movements of xand y.
IV. EXPERIMENTAL SETTINGS
Neural networks. ML scientists have developed a variety
of convolutional blocks (e.g., VGG, Residual connect). They
propose different neural networks (e.g., Resnet50, Resnet101,
...) with sometimes the help of automatic tuning technics
to optimize and tune the trade-off between accuracy and
computing efficiency. We present four of them in table I.
TABLE I: Four deep neural network architecture. We choose
them diverse in terms of depth, width (rate between #params
and #layers), and density.
#param.
#layers #layers #jumps #param. Jump type
VGG19 7.26M 19 0 138M N/A
ResNet50 0.52M 50 16 26M Additions
DensetNet201 0.1M 201 98 20M Concatenations
EfficientNetB0 0.06M 89 25 5.3M Mult. and Add.
GPU technologies have evolved in computing devices con-
taining thousands of cores and gigabytes of memory, and
modern deep neural networks may not guarantee the use of all
resources in one single GPU. To match the ML community
computing needs, an ideal inference framework should not
only efficiently leverage its resources to one DNN but should
be performant to run multiple DNNs at the same time. That
is why propose the ensembles in our thesis. All values in
parentheses are our evaluated test top1-accuracy on Ima-
geNet: VGG19 (69.80%), ResNet50 (74.26%), DenseNet201
(75.11%) and ensemble them by interpolating their predictions
performs 76.92%. We choose not to add efficient-netB0 in the
ensemble due to operations errors with some frameworks.
Our benchmarks are performed on two machines. We do
not distribute inference over multiple GPUs for keeping the
analysis focused on the inference framework. Even if we make
expect performance gain using the map-reduce paradigm.
Machine A is an HGX1 equipped with Tesla V100 SXM2
GPUs containing 5120 Cuda cores running at 1312Mhz-
1530Mhz frequency and 16G of GPU memory. Its CPU is
a dual-socket Intel(R) Xeon(R) CPU E5-2698 v4 with a total
of 80 cores @ 1.2Ghz-3.6Ghz and 512GB of RAM. The total
GPU board power is 300 watts.
Machine B is an in-house machine equipped with NVIDIA
Amper A100 PCI-E GPUs, each one containing 6912 Cuda
cores running at 765Mhz-1410Mhz and 40G of GPU memory.
Its CPU is a single-socket AMD EPYC 7F52 with 16 cores
running at 2.5Ghz-3.5Ghz and 256GB of RAM. The total GPU
board power is 250 watts.
Software stack. All software stack versions are the same
on both machines: machine A and machine B. The assessed
inference framework versions are Tensorflow 2.6, TensorRT
8.0, Onnx-runtime 1.10, and OpenVINO 2021. The needed
neural network converters to benchmark those frameworks are
tf2onnx 1.9.3, and LLVM 14.0.
Machine A is running on Ubuntu and we use Python
3.9, Machine B is running on CentOS and uses Python 3.8.
Specialized OS for real-time may improve latency determinism
and latency speed but could lower throughput performance
too. We evaluated MLIR (onnx-mlir 0.2 framework [5]) on
Machine B but it support only ResNet50.
In all our benchmarks Tensorflow is accelerated with XLA
[4] (Accelerated Linear Algebra).
Graph optimization settings: