
DeepPerform: An Eicient Approach for Performance Testing of Resource-Constrained Neural Networks ASE ’22, October 10–14, 2022, Rochester, MI, USA
outputs satisfy a particular criteria. The working mechanism of
early-termination AdNNs can be formulated as,
(𝐸𝑥𝑖𝑡𝑁 𝑁 (𝑥)=𝐸𝑥𝑖𝑡𝑖(𝑥),if 𝐵𝑖(𝑥) ≥ 𝜏𝑖
𝐼𝑛𝑖+1(𝑥)=𝑂𝑢𝑡𝑖(𝑥),otherwise (2)
2.2 Redundant Computation
In a software program, if an operation is not required but performed,
we term the operation as redundant operation. For Adaptive Neural
Networks, if a component is activated without aecting AdNNs’
nal predictions, we dene the computation as a redundant com-
putation. AdNNs are created based on the philosophy that all the
inputs should not require all DNN components for inference. For
example, we can refer to the images in Fig. 2. The left box shows the
AdNNs’ design philosophy. That is, AdNNs consume more energy
for detecting images with further complexity. However, when the
third image in the left box is perturbed with minimal perturbations
and becomes the rightmost one, AdNNs’ inference energy consump-
tion will increase signicantly (from 30
𝑗
to 68
𝑗
). We refer to such
additional computation as redundant computation or performance
degradation.
2.3 Performance & Computational Complexity
In this section, we describe the relationship between hardware-
dependent performance metrics and DNN computational complex-
ity. Although many metrics can reect DNN performance, we
chose latency and energy consumption as hardware-dependent
performance metrics because of their critical nature for real-time
embedded systems [
3
,
49
]. Measuring hardware-dependent per-
formance metrics (e.g., latency, energy consumption) usually re-
quires many repeated experiments, which is costly. Hence, exist-
ing work [
12
,
14
,
29
,
35
,
41
,
52
] proposes to apply oating point
operations (FLOPs) to represent DNN computational complexity.
However, a recent study [
43
] demonstrates that simply lowering
DNN computational complexity (FLOPs) does not always improve
DNN runtime performance. This is because modern hardware plat-
forms usually apply parallelism to handle DNN oating-point op-
erations (FLOPs). Parallelism can accelerate computation within
layers, while each DNN layer is computed sequentially. Thus, For
two DNNs with the same total FLOPs, dierent FLOPs allocating
strategies will result in dierent parallelism utilization and dierent
DNN model performance. However, for AdNNs, each layer/block
usually has a similar structure and FLOPs [
12
,
14
,
34
,
52
]. Thus
the parallelism utilization is similar for each block. Because paral-
lelism can not accelerate computation between blocks, increasing
the number of computational blocks/layers will degrade AdNNs’
performance. To further understand the relation between AdNNs’
FLOPs and AdNNs’ model performance, we conduct a study in §3.
3 PRELIMINARY STUDY
3.1 Study Approach
Our intuition is to explore the worst computational complexity of
an algorithm or model. For AdNNs, the basic computation are the
oating-point operations (FLOPs). Thus, we made an assumption
that the FLOPs count of an AdNN is a hardware-independent metric
to approximate AdNN performance. To validate such an assumption,
30 + 38
Redundant !!!
Simple Complex
10 20 30 40 50 60
Image
Energy (j)
Perturbed Image
Figure 2: Left Box shows that AdNNs allocate dierent com-
putational resources for images with dierent semantic
complexity; rights box shows that perturbed image could
trigger redundant computation and cause energy surge.
we conduct an empirical study. Specically, we compute the Pearson
Product-moment Correlation Co-ecient (PCCs) [
40
] between AdNN
FLOPs against AdNN latency and energy consumption. PCCs are
widely used in statistical methods to measure the linear correlation
between two variables. PCCs are normalized covariance measure-
ments, ranging from -1 to 1. Higher PCCs indicate that the two
variables are more positively related. If the PCCs between FLOPs
against system latency and system energy consumption are both
high, then we validate our assumption.
3.2 Study Model & Dataset
We select subjects (e.g., model,dataset) following policies below.
•The selected subjects are publicly available.
•The selected subjects are widely used in existing work.
•
The selected dataset and models should be diverse from dierent
perspectives. e.g.,, the selected models should include both early-
termination and conditional-skipping AdNNs.
We select ve popular model-dataset combinations used for image
classication tasks as our experimental subjects. The dataset and the
corresponding model are listed in Table 1. We explain the selected
datasets and corresponding models below.
Datasets.
CIFAR-10 [
25
] is a database for object recognition. There
is a total of ten object classes for this dataset, and the image size of
the image in CIFAR-10 is 32
×
32. CIFAR-10 contains 50,000 training
images and 10,000 testing images. CIFAR-100 [
25
] is similar to
CIFAR-10 [
25
] but with 100 classes. It also contains 50,000 training
images and 10,000 testing images. SVHN [
36
] is a real-world image
dataset obtained from house numbers in Google Street View images.
There are 73257 training images and 26032 testing images in SVHN.
Models.
For CIFAR-10 dataset, we use SkipNet [
52
] and BlockDrop
[
53
] models. SkipNet applies reinforcement learning to train DNNs
to skip unnecessary blocks, and BlockDrop trains a policy network
to activate partial blocks to save computation costs. We download
trained SkipNet and BlockDrop from the authors’ websites. For
CIFAR-100 dataset, we use RaNet [
56
] and DeepShallow [
24
] mod-
els for evaluation. DeepShallow adaptive scales DNN depth, while
RaNet scales both input resolution and DNN depth to balance ac-
curacy and performance. For SVHN dataset, DeepShallow [
24
] is
used for evaluation. For RaNet [
56
] and DeepShallow [
24
] archi-
tecture, the author does not release the trained model weights but
open-source their training codes. Therefore, we follow the authors’
instructions to train the model weights.
3.3 Study Process
We begin by evaluating each model’s computational complexity on
the original hold-out test dataset. After that, we deploy the AdNN