Latency-aware Spatial-wise Dynamic Networks Yizeng Han1Zhihang Yuan2Yifan Pu1Chenhao Xue2 Shiji Song1Guangyu Sun2Gao Huang1y

2025-04-27 0 0 4.66MB 17 页 10玖币
侵权投诉
Latency-aware Spatial-wise Dynamic Networks
Yizeng Han1Zhihang Yuan2Yifan Pu1Chenhao Xue2
Shiji Song1Guangyu Sun2Gao Huang1
1Department of Automation, BNRist, Tsinghua University, Beijing, China
2School of Electronics Engineering and Computer Science, Peking University, Beijing, China
{hanyz18, pyf20}@mails.tsinghua.edu.cn,{shijis, gaohuang}@tsinghua.edu.cn
{yuanzhihang, xch927027, gsun}@pku.edu.cn
Abstract
Spatial-wise dynamic convolution has become a promising approach to improving
the inference efficiency of deep networks. By allocating more computation to the
most informative pixels, such an adaptive inference paradigm reduces the spatial
redundancy in image features and saves a considerable amount of unnecessary
computation. However, the theoretical efficiency achieved by previous methods
can hardly translate into a realistic speedup, especially on the multi-core processors
(e.g. GPUs). The key challenge is that the existing literature has only focused
on designing algorithms with minimal computation, ignoring the fact that the
practical latency can also be influenced by scheduling strategies and hardware
properties. To bridge the gap between theoretical computation and practical ef-
ficiency, we propose a latency-aware spatial-wise dynamic network (LASNet),
which performs coarse-grained spatially adaptive inference under the guidance
of a novel latency prediction model. The latency prediction model can efficiently
estimate the inference latency of dynamic networks by simultaneously considering
algorithms, scheduling strategies, and hardware properties. We use the latency
predictor to guide both the algorithm design and the scheduling optimization on
various hardware platforms. Experiments on image classification, object detection
and instance segmentation demonstrate that the proposed framework significantly
improves the practical inference efficiency of deep networks. For example, the
average latency of a ResNet-101 on the ImageNet validation set could be reduced
by 36% and 46% on a server GPU (Nvidia Tesla-V100) and an edge device (Nvidia
Jetson TX2 GPU) respectively without sacrificing the accuracy. Code is available
at https://github.com/LeapLabTHU/LASNet.
1 Introduction
Dynamic neural networks [
7
] have attracted great research interests in recent years. Compared to
static models [
11
,
17
,
13
,
23
] which treat different inputs equally during inference, dynamic networks
can allocate the computation in a data-dependent manner. For example, they can conditionally skip
the computation of network layers [
15
,
9
,
32
,
30
] or convolutional channels [
19
,
1
], or perform
spatially adaptive inference on the most informative image regions (e.g. the foreground areas)
[
6
,
5
,
31
,
35
,
33
,
8
]. Spatial-wise dynamic networks, which typically decide whether to compute
each feature pixel with masker modules [
5
,
31
,
35
,
8
] (Figure 1 (a)), have shown promising results in
improving the inference efficiency of convolution neural networks (CNNs).
Despite the remarkable theoretical efficiency achieved by spatial-wise dynamic networks [
5
,
31
,
35
],
researchers have found it challenging to translate the theoretical results into realistic speedup,
Equal contribution.
Corresponding author.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.06223v1 [cs.CV] 12 Oct 2022
Masker
gather
...
convolution(s) scatter
...
Dynamic Convolution(s)
Algorithm
Scheduling
HardwareLatency
(a) Algorithm: Spatially adaptive inference with a masker.
(b) Scheduling: the practical inference pipeline of dynamic convolutions. (c) The general idea of our work.
influence guidance
selected pixel locations
Figure 1: An overview of our method. (a) illustrates the spatially adaptive inference algorithm; (b) is
the scheduling strategy; and (c) presents the three key factors to the practical latency. For a given
hardware, the latency is used to guide our algorithm design and scheduling optimization.
especially on some multi-core processors, e.g., GPUs [
35
,
3
,
8
]. The challenges are two-fold: 1) most
previous approaches [
5
,
31
,
35
] perform spatially adaptive inference at the finest granularity: every
pixel is flexibly decided whether to be computed or not. Such flexibility induces non-contiguous
memory access [
35
] and requires specialized scheduling strategies (Figure 1 (b)); 2) the existing
literature has only adopted the hardware-agnostic FLOPs (floating-point operations) as an inaccurate
proxy for the efficiency, lacking latency-aware guidance on the algorithm design. For dynamic
networks, the adaptive computation with sub-optimal scheduling strategies further enlarges the
discrepancy between the theoretical FLOPs and the practical latency. Note that it has been validated
by previous works that the latency on CPUs has a strong correlation with FLOPs [
8
,
35
]. Therefore,
we mainly focus on the GPU platform in this paper, which is more challenging and less explored.
We address the above challenges by proposing a latency-aware spatial-wise dynamic network
(LASNet). Three key factors to the inference latency are considered: the algorithm, the scheduling
strategy, and the hardware properties. Given a target hardware device, we directly use the latency,
rather than the FLOPs, to guide our algorithm design and scheduling optimization (Figure 1 (c)).
Because the memory access pattern and the scheduling strategies in our dynamic operators differ
from those in static networks, the libraries developed for static models (e.g. cuDNN) are sub-optimal
for the acceleration of dynamic models. Without the support of libraries, each dynamic operator
requires scheduling optimization, code optimization, compiling, and deployment for each device.
Therefore, it is laborious to evaluate the network latency on different hardware platforms. To this end,
we propose a novel latency prediction model to efficiently estimate the realistic latency of a network
by simultaneously considering the aforementioned three factors. Compared to the hardware-agnostic
FLOPs, our predicted latency can better reflect the practical efficiency of dynamic models.
Guided by this latency prediction model, we establish our latency-aware spatial-wise dynamic network
(LASNet), which adaptively decides whether to allocate computation on feature patches instead
of pixels [
5
,
31
,
35
] (Figure 2 top). We name this paradigm as spatially adaptive inference at a
coarse granularity. While less flexible than the pixel-level adaptive computation in previous works
[
5
,
31
,
35
], it facilitates more contiguous memory access, benefiting the realistic speedup on hardware.
The scheduling strategy and the implementation are further ameliorated for faster inference.
It is worth noting that LASNet is designed as a general framework in two aspects: 1) the coarse-
grained spatially adaptive inference paradigm can be conveniently implemented in various CNN
backbones, e.g., ResNets [
11
], DenseNets [
17
] and RegNets [
25
]; and 2) the latency predictor is an
off-the-shell tool which can be directly used for various computing platforms (e.g. server GPUs and
edge devices).
We evaluate the performance of LASNet on multiple CNN architectures on image classification,
object detection, and instance segmentation tasks. Experiment results show that our LASNet improves
the efficiency of deep CNNs both theoretically and practically. For example, the inference latency
of ResNet-101 on ImageNet [
4
] is reduced by 36% and 46% on an Nvidia Tesla V100 GPU and
an Nvidia Jetson TX2 GPU, respectively, without sacrificing the accuracy. Moreover, the proposed
method outperforms various lightweight networks in a low-FLOPs regime.
Our main contributions are summarized as follows:
2
1.
We propose LASNet, which performs coarse-grained spatially adaptive inference guided
by the practical latency instead of the theoretical FLOPs. To the best of our knowledge,
LASNet is the first framework that directly considers the real latency in the design phase of
dynamic neural networks;
2.
We propose a latency prediction model, which can efficiently and accurately estimate the
latency of dynamic operators by simultaneously considering the algorithm, the scheduling
strategy, and the hardware properties;
3.
Experiments on image classification and downstream tasks verify that our proposed LAS-
Net can effectively improve the practical efficiency of different CNN architectures.
2 Related works
Spatial-wise dynamic network
is a common type of dynamic neural networks [
7
]. Compared to
static models which treat different feature locations evenly during inference, these networks perform
spatially adaptive inference on the most informative regions (e.g., foregrounds), and reduce the
unnecessary computation on less important areas (e.g., backgrounds). Existing works mainly include
three levels of dynamic computation: resolution level [
36
,
37
], region level [
33
] and pixel level
[
5
,
31
,
35
]. The former two generally manipulate the network inputs [
33
,
37
] or require special
architecture design [
36
]. In contrast, pixel-level dynamic networks can flexibly skip the convolutions
on certain feature pixels in arbitrary CNN backbones [
5
,
31
,
35
]. Despite its remarkable theoretical
efficiency, pixel-wise dynamic computation brings considerable difficulty to achieving realistic
speedup on multi-core processors, e.g., GPUs. Compared to the previous approaches [
5
,
31
,
35
]
which only focus on reducing the theoretical computation, we propose to directly use the latency to
guide our algorithm design and scheduling optimization.
Hardware-aware network design.
To bridge the gap between theoretical and practical efficiency
of deep models, researchers have started to consider the real latency in the network design phase.
There are two lines of works in this direction. One directly performs speed tests on targeted
devices, and summarizes some guidelines to facilitate hand-designing lightweight models [
23
]. The
other line of work searches for fast models using the neural architecture search (NAS) technique
[
29
,
34
]. However, all existing works try to build static models, which have intrinsic computational
redundancy by treating different inputs in the same way. However, speed tests for dynamic operators
on different hardware devices can be very laborious and impractical. In contrast, our proposed latency
prediction model can efficiently estimate the inference latency on any given computing platforms by
simultaneously considering algorithm design, scheduling strategies and hardware properties.
3 Methodology
In this section, we first introduce the preliminaries of spatially adaptive inference, and then demon-
strate the architecture design of our LASNet. The latency prediction model is then explained, which
guides the granularity settings and the scheduling optimization for LASNet. We further present the
implementation improvements for faster inference, followed by the training strategies.
3.1 Preliminaries
Spatially adaptive inference.
The existing spatial-wise dynamic networks are usually established
by attaching a masker
M
in each convolutional block of a CNN backbone (Figure 1 (a)). Specifically,
let
xRH×W×C
denote the input of a block, where
H
and
W
are the feature height and width, and
C
is the channel number. The masker
M
takes
x
as input, and generates a binary-valued spatial
mask
M=M(x)∈ {0,1}H×W
. Each element of
M
determines whether to perform convolution
operations on the corresponding location of the output feature. The unselected regions will be filled
with the values from the input [
5
,
31
] or obtained via interpolation [
35
]. We define the activation rate
of a block as r=Pi,j Mi,j
H×W, representing the ratio of the calculated pixels.
Scheduling strategy.
During inference, the current scheduling strategy for spatial-wise dynamic
convolutions generally involve three steps [
26
] (Figure 1 (b)): 1) gathering, which re-organizes the
selected pixels (if the convolution kernel size is greater than
1×1
, the neighbors are also required)
along the batch dimension; 2) computation, which performs convolution on the gathered input; and
3) scattering, which fills the computed pixels on their corresponding locations of the output feature.
3
Masker
Masker-
Conv1x1 Conv1x1
...
Scatter-Add
Conv1x1 Conv3x3
Gumbel
Noise
Training
Inference
Input Output
Upsample Element-wise
multiplication Selected pixel
Gather-
Conv3x3
Conv1x1
Regular convolution Operator fusion
Input Output
Figure 2: Our proposed LASNet block. Top: we first generate a low-resolution spatial mask
Mcoarse
,
which is then upsampled to obtain the mask
M
with the same size as the output feature. Gumbel
Softmax [
18
,
24
] is used for end-to-end training (Sec. 3.5). Bottom: the scheduling optimization is
performed to decrease the memory access for faster inference (Sec. 3.4).
Limitations.
Compared to performing convolutions on the entire feature map, the aforementioned
scheduling strategy reduces the computation while bringing considerable overhead to memory access
due to the mask generation and the non-contiguous memory access. Such overhead would increase
the overall latency, especially when the granularity of dynamic convolution is at the finest pixel level.
3.2 Architecture design
Spatial granularity.
As mentioned above, pixel-level dynamic convolutions [
5
,
31
,
35
] raise sub-
stantial challenges to achieving realistic speedup on multi-core processors due to the non-contiguous
memory access. To this end, we propose to optimize the granularity of spatially adaptive inference.
Specifically, take the commonly used bottleneck structure in [
11
] as an example, our coarse-grained
spatial-wise dynamic convolutional block is illustrated in Figure 2. Instead of directly producing
a mask with the shape of
H×W
, we first generate a low-resolution mask
Mcoarse ∈ {0,1}H
S×W
S
,
where
S
is named as the spatial granularity. Each element in
Mcoarse
determines the computation
of an
S×S
-sized feature patch. For instance, the feature size in the first ResNet stage
3
is
56 ×56
.
Then the possible choices for
S
are
{1,2,4,7,8,14,28,56}.
The mask
Mcoarse
is then upsampled
to the size of
H×W
. Notably,
S= 1
means that the granularity is still at the pixel level as previous
methods [
5
,
31
,
35
]. Note that the other extreme situation (
S= 56
) is not considered in this paper,
when the masker directly determines whether to skip the whole block (i.e. layer skipping [
30
,
32
]).
Such an overly aggressive approach will lead to considerable drop of accuracy, as we presented in
Appendix C.2. The masker is composed of a pooling layer followed by a 1×1convolution.
Differences to existing works.
Without using the interpolation operation [
35
] or the carefully
designed two-branch structure [
8
], the proposed block architecture is simple and sufficiently general
to be plugged into most backbones with minimal modification. Our formulation is mostly similar to
that in [
31
], which could be viewed as a variant of our method with the spatial granularity
S
=1 for all
blocks. Instead of performing spatially adaptive inference at the finest pixel level, our granularity
S
is
optimized under the guidance of our latency prediction model (details are presented in the following
Sec. 4.2) to achieve realistic speedup on target computing platforms.
3.3 Latency prediction model
As stated before, it is laborious to evaluate the latency of dynamic operators on different hardware
platforms. To efficiently seek preferable granularity settings on arbitrary hardware devices, we
propose a latency prediction model
G
, which can directly predict the delay of executing dynamic
operators on any target devices. For a spatial-wise dynamic convolutional block, the latency predictor
G
takes the hardware properties
H
, the layer parameters
P
, the spatial granularity
S
, and the activation
rate ras input and predicts the latency `of a dynamic convolutional block: `=G(H,P, S, r).
Hardware modeling.
We model a hardware device as multiple processing engines (PEs), and parallel
computation can be executed on these PEs. As shown in Figure 3, we model the memory system as a
3
Here we refer to a stage as the cascading of multiple blocks which process features with the same resolution.
4
摘要:

Latency-awareSpatial-wiseDynamicNetworksYizengHan1ZhihangYuan2YifanPu1ChenhaoXue2ShijiSong1GuangyuSun2GaoHuang1y1DepartmentofAutomation,BNRist,TsinghuaUniversity,Beijing,China2SchoolofElectronicsEngineeringandComputerScience,PekingUniversity,Beijing,China{hanyz18,pyf20}@mails.tsinghua.edu.cn,{shi...

展开>> 收起<<
Latency-aware Spatial-wise Dynamic Networks Yizeng Han1Zhihang Yuan2Yifan Pu1Chenhao Xue2 Shiji Song1Guangyu Sun2Gao Huang1y.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:4.66MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注