
1.
We propose LASNet, which performs coarse-grained spatially adaptive inference guided
by the practical latency instead of the theoretical FLOPs. To the best of our knowledge,
LASNet is the first framework that directly considers the real latency in the design phase of
dynamic neural networks;
2.
We propose a latency prediction model, which can efficiently and accurately estimate the
latency of dynamic operators by simultaneously considering the algorithm, the scheduling
strategy, and the hardware properties;
3.
Experiments on image classification and downstream tasks verify that our proposed LAS-
Net can effectively improve the practical efficiency of different CNN architectures.
2 Related works
Spatial-wise dynamic network
is a common type of dynamic neural networks [
7
]. Compared to
static models which treat different feature locations evenly during inference, these networks perform
spatially adaptive inference on the most informative regions (e.g., foregrounds), and reduce the
unnecessary computation on less important areas (e.g., backgrounds). Existing works mainly include
three levels of dynamic computation: resolution level [
36
,
37
], region level [
33
] and pixel level
[
5
,
31
,
35
]. The former two generally manipulate the network inputs [
33
,
37
] or require special
architecture design [
36
]. In contrast, pixel-level dynamic networks can flexibly skip the convolutions
on certain feature pixels in arbitrary CNN backbones [
5
,
31
,
35
]. Despite its remarkable theoretical
efficiency, pixel-wise dynamic computation brings considerable difficulty to achieving realistic
speedup on multi-core processors, e.g., GPUs. Compared to the previous approaches [
5
,
31
,
35
]
which only focus on reducing the theoretical computation, we propose to directly use the latency to
guide our algorithm design and scheduling optimization.
Hardware-aware network design.
To bridge the gap between theoretical and practical efficiency
of deep models, researchers have started to consider the real latency in the network design phase.
There are two lines of works in this direction. One directly performs speed tests on targeted
devices, and summarizes some guidelines to facilitate hand-designing lightweight models [
23
]. The
other line of work searches for fast models using the neural architecture search (NAS) technique
[
29
,
34
]. However, all existing works try to build static models, which have intrinsic computational
redundancy by treating different inputs in the same way. However, speed tests for dynamic operators
on different hardware devices can be very laborious and impractical. In contrast, our proposed latency
prediction model can efficiently estimate the inference latency on any given computing platforms by
simultaneously considering algorithm design, scheduling strategies and hardware properties.
3 Methodology
In this section, we first introduce the preliminaries of spatially adaptive inference, and then demon-
strate the architecture design of our LASNet. The latency prediction model is then explained, which
guides the granularity settings and the scheduling optimization for LASNet. We further present the
implementation improvements for faster inference, followed by the training strategies.
3.1 Preliminaries
Spatially adaptive inference.
The existing spatial-wise dynamic networks are usually established
by attaching a masker
M
in each convolutional block of a CNN backbone (Figure 1 (a)). Specifically,
let
x∈RH×W×C
denote the input of a block, where
H
and
W
are the feature height and width, and
C
is the channel number. The masker
M
takes
x
as input, and generates a binary-valued spatial
mask
M=M(x)∈ {0,1}H×W
. Each element of
M
determines whether to perform convolution
operations on the corresponding location of the output feature. The unselected regions will be filled
with the values from the input [
5
,
31
] or obtained via interpolation [
35
]. We define the activation rate
of a block as r=Pi,j Mi,j
H×W, representing the ratio of the calculated pixels.
Scheduling strategy.
During inference, the current scheduling strategy for spatial-wise dynamic
convolutions generally involve three steps [
26
] (Figure 1 (b)): 1) gathering, which re-organizes the
selected pixels (if the convolution kernel size is greater than
1×1
, the neighbors are also required)
along the batch dimension; 2) computation, which performs convolution on the gathered input; and
3) scattering, which fills the computed pixels on their corresponding locations of the output feature.
3