Latency-aware Spatial-wise Dynamic Networks Yizeng Han1Zhihang Yuan2Yifan Pu1Chenhao Xue2 Shiji Song1Guangyu Sun2Gao Huang1y

2025-04-27 1 0 4.66MB 17 页 10玖币

侵权投诉

Latency-aware Spatial-wise Dynamic Networks

Yizeng Han1∗Zhihang Yuan2∗Yifan Pu1∗Chenhao Xue2

Shiji Song1Guangyu Sun2Gao Huang1†

1Department of Automation, BNRist, Tsinghua University, Beijing, China

2School of Electronics Engineering and Computer Science, Peking University, Beijing, China

{hanyz18, pyf20}@mails.tsinghua.edu.cn,{shijis, gaohuang}@tsinghua.edu.cn

{yuanzhihang, xch927027, gsun}@pku.edu.cn

Abstract

Spatial-wise dynamic convolution has become a promising approach to improving

the inference efﬁciency of deep networks. By allocating more computation to the

most informative pixels, such an adaptive inference paradigm reduces the spatial

redundancy in image features and saves a considerable amount of unnecessary

computation. However, the theoretical efﬁciency achieved by previous methods

can hardly translate into a realistic speedup, especially on the multi-core processors

(e.g. GPUs). The key challenge is that the existing literature has only focused

on designing algorithms with minimal computation, ignoring the fact that the

practical latency can also be inﬂuenced by scheduling strategies and hardware

properties. To bridge the gap between theoretical computation and practical ef-

ﬁciency, we propose a latency-aware spatial-wise dynamic network (LASNet),

which performs coarse-grained spatially adaptive inference under the guidance

of a novel latency prediction model. The latency prediction model can efﬁciently

estimate the inference latency of dynamic networks by simultaneously considering

algorithms, scheduling strategies, and hardware properties. We use the latency

predictor to guide both the algorithm design and the scheduling optimization on

various hardware platforms. Experiments on image classiﬁcation, object detection

and instance segmentation demonstrate that the proposed framework signiﬁcantly

improves the practical inference efﬁciency of deep networks. For example, the

average latency of a ResNet-101 on the ImageNet validation set could be reduced

by 36% and 46% on a server GPU (Nvidia Tesla-V100) and an edge device (Nvidia

Jetson TX2 GPU) respectively without sacriﬁcing the accuracy. Code is available

at https://github.com/LeapLabTHU/LASNet.

1 Introduction

Dynamic neural networks [

] have attracted great research interests in recent years. Compared to

static models [

] which treat different inputs equally during inference, dynamic networks

can allocate the computation in a data-dependent manner. For example, they can conditionally skip

the computation of network layers [

] or convolutional channels [

], or perform

spatially adaptive inference on the most informative image regions (e.g. the foreground areas)

[

]. Spatial-wise dynamic networks, which typically decide whether to compute

each feature pixel with masker modules [

] (Figure 1 (a)), have shown promising results in

improving the inference efﬁciency of convolution neural networks (CNNs).

Despite the remarkable theoretical efﬁciency achieved by spatial-wise dynamic networks [

researchers have found it challenging to translate the theoretical results into realistic speedup,

∗Equal contribution.

†Corresponding author.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06223v1 [cs.CV] 12 Oct 2022

Masker

gather

...

convolution(s) scatter

...

Dynamic Convolution(s)

Algorithm

Scheduling

HardwareLatency

(a) Algorithm: Spatially adaptive inference with a masker.

(b) Scheduling: the practical inference pipeline of dynamic convolutions. (c) The general idea of our work.

inﬂuence guidance

selected pixel locations

Figure 1: An overview of our method. (a) illustrates the spatially adaptive inference algorithm; (b) is

the scheduling strategy; and (c) presents the three key factors to the practical latency. For a given

hardware, the latency is used to guide our algorithm design and scheduling optimization.

especially on some multi-core processors, e.g., GPUs [

]. The challenges are two-fold: 1) most

previous approaches [

] perform spatially adaptive inference at the ﬁnest granularity: every

pixel is ﬂexibly decided whether to be computed or not. Such ﬂexibility induces non-contiguous

memory access [

] and requires specialized scheduling strategies (Figure 1 (b)); 2) the existing

literature has only adopted the hardware-agnostic FLOPs (ﬂoating-point operations) as an inaccurate

proxy for the efﬁciency, lacking latency-aware guidance on the algorithm design. For dynamic

networks, the adaptive computation with sub-optimal scheduling strategies further enlarges the

discrepancy between the theoretical FLOPs and the practical latency. Note that it has been validated

by previous works that the latency on CPUs has a strong correlation with FLOPs [

]. Therefore,

we mainly focus on the GPU platform in this paper, which is more challenging and less explored.

We address the above challenges by proposing a latency-aware spatial-wise dynamic network

(LASNet). Three key factors to the inference latency are considered: the algorithm, the scheduling

strategy, and the hardware properties. Given a target hardware device, we directly use the latency,

rather than the FLOPs, to guide our algorithm design and scheduling optimization (Figure 1 (c)).

Because the memory access pattern and the scheduling strategies in our dynamic operators differ

from those in static networks, the libraries developed for static models (e.g. cuDNN) are sub-optimal

for the acceleration of dynamic models. Without the support of libraries, each dynamic operator

requires scheduling optimization, code optimization, compiling, and deployment for each device.

Therefore, it is laborious to evaluate the network latency on different hardware platforms. To this end,

we propose a novel latency prediction model to efﬁciently estimate the realistic latency of a network

by simultaneously considering the aforementioned three factors. Compared to the hardware-agnostic

FLOPs, our predicted latency can better reﬂect the practical efﬁciency of dynamic models.

Guided by this latency prediction model, we establish our latency-aware spatial-wise dynamic network

(LASNet), which adaptively decides whether to allocate computation on feature patches instead

of pixels [

] (Figure 2 top). We name this paradigm as spatially adaptive inference at a

coarse granularity. While less ﬂexible than the pixel-level adaptive computation in previous works

[

], it facilitates more contiguous memory access, beneﬁting the realistic speedup on hardware.

The scheduling strategy and the implementation are further ameliorated for faster inference.

It is worth noting that LASNet is designed as a general framework in two aspects: 1) the coarse-

grained spatially adaptive inference paradigm can be conveniently implemented in various CNN

backbones, e.g., ResNets [

], DenseNets [

] and RegNets [

]; and 2) the latency predictor is an

off-the-shell tool which can be directly used for various computing platforms (e.g. server GPUs and

edge devices).

We evaluate the performance of LASNet on multiple CNN architectures on image classiﬁcation,

object detection, and instance segmentation tasks. Experiment results show that our LASNet improves

the efﬁciency of deep CNNs both theoretically and practically. For example, the inference latency

of ResNet-101 on ImageNet [

] is reduced by 36% and 46% on an Nvidia Tesla V100 GPU and

an Nvidia Jetson TX2 GPU, respectively, without sacriﬁcing the accuracy. Moreover, the proposed

method outperforms various lightweight networks in a low-FLOPs regime.

Our main contributions are summarized as follows:

We propose LASNet, which performs coarse-grained spatially adaptive inference guided

by the practical latency instead of the theoretical FLOPs. To the best of our knowledge,

LASNet is the ﬁrst framework that directly considers the real latency in the design phase of

dynamic neural networks;

We propose a latency prediction model, which can efﬁciently and accurately estimate the

latency of dynamic operators by simultaneously considering the algorithm, the scheduling

strategy, and the hardware properties;

Experiments on image classiﬁcation and downstream tasks verify that our proposed LAS-

Net can effectively improve the practical efﬁciency of different CNN architectures.

2 Related works

Spatial-wise dynamic network

is a common type of dynamic neural networks [

]. Compared to

static models which treat different feature locations evenly during inference, these networks perform

spatially adaptive inference on the most informative regions (e.g., foregrounds), and reduce the

unnecessary computation on less important areas (e.g., backgrounds). Existing works mainly include

three levels of dynamic computation: resolution level [

], region level [

] and pixel level

[

]. The former two generally manipulate the network inputs [

] or require special

architecture design [

]. In contrast, pixel-level dynamic networks can ﬂexibly skip the convolutions

on certain feature pixels in arbitrary CNN backbones [

]. Despite its remarkable theoretical

efﬁciency, pixel-wise dynamic computation brings considerable difﬁculty to achieving realistic

speedup on multi-core processors, e.g., GPUs. Compared to the previous approaches [

]

which only focus on reducing the theoretical computation, we propose to directly use the latency to

guide our algorithm design and scheduling optimization.

Hardware-aware network design.

To bridge the gap between theoretical and practical efﬁciency

of deep models, researchers have started to consider the real latency in the network design phase.

There are two lines of works in this direction. One directly performs speed tests on targeted

devices, and summarizes some guidelines to facilitate hand-designing lightweight models [

]. The

other line of work searches for fast models using the neural architecture search (NAS) technique

[

]. However, all existing works try to build static models, which have intrinsic computational

redundancy by treating different inputs in the same way. However, speed tests for dynamic operators

on different hardware devices can be very laborious and impractical. In contrast, our proposed latency

prediction model can efﬁciently estimate the inference latency on any given computing platforms by

simultaneously considering algorithm design, scheduling strategies and hardware properties.

3 Methodology

In this section, we ﬁrst introduce the preliminaries of spatially adaptive inference, and then demon-

strate the architecture design of our LASNet. The latency prediction model is then explained, which

guides the granularity settings and the scheduling optimization for LASNet. We further present the

implementation improvements for faster inference, followed by the training strategies.

3.1 Preliminaries

Spatially adaptive inference.

The existing spatial-wise dynamic networks are usually established

by attaching a masker

in each convolutional block of a CNN backbone (Figure 1 (a)). Speciﬁcally,

let

x∈RH×W×C

denote the input of a block, where

and

are the feature height and width, and

is the channel number. The masker

takes

as input, and generates a binary-valued spatial

mask

M=M(x)∈ {0,1}H×W

. Each element of

determines whether to perform convolution

operations on the corresponding location of the output feature. The unselected regions will be ﬁlled

with the values from the input [

] or obtained via interpolation [

]. We deﬁne the activation rate

of a block as r=Pi,j Mi,j

H×W, representing the ratio of the calculated pixels.

Scheduling strategy.

During inference, the current scheduling strategy for spatial-wise dynamic

convolutions generally involve three steps [

] (Figure 1 (b)): 1) gathering, which re-organizes the

selected pixels (if the convolution kernel size is greater than

1×1

, the neighbors are also required)

along the batch dimension; 2) computation, which performs convolution on the gathered input; and

3) scattering, which ﬁlls the computed pixels on their corresponding locations of the output feature.

Masker

Masker-

Conv1x1 Conv1x1

...

Scatter-Add

Conv1x1 Conv3x3

Gumbel

Noise

Training

Inference

Input Output

Upsample Element-wise

multiplication Selected pixel

Gather-

Conv3x3

Conv1x1

Regular convolution Operator fusion

Input Output

Figure 2: Our proposed LASNet block. Top: we ﬁrst generate a low-resolution spatial mask

Mcoarse

which is then upsampled to obtain the mask

with the same size as the output feature. Gumbel

Softmax [

] is used for end-to-end training (Sec. 3.5). Bottom: the scheduling optimization is

performed to decrease the memory access for faster inference (Sec. 3.4).

Limitations.

Compared to performing convolutions on the entire feature map, the aforementioned

scheduling strategy reduces the computation while bringing considerable overhead to memory access

due to the mask generation and the non-contiguous memory access. Such overhead would increase

the overall latency, especially when the granularity of dynamic convolution is at the ﬁnest pixel level.

3.2 Architecture design

Spatial granularity.

As mentioned above, pixel-level dynamic convolutions [

] raise sub-

stantial challenges to achieving realistic speedup on multi-core processors due to the non-contiguous

memory access. To this end, we propose to optimize the granularity of spatially adaptive inference.

Speciﬁcally, take the commonly used bottleneck structure in [

] as an example, our coarse-grained

spatial-wise dynamic convolutional block is illustrated in Figure 2. Instead of directly producing

a mask with the shape of

H×W

, we ﬁrst generate a low-resolution mask

Mcoarse ∈ {0,1}H

S×W

where

is named as the spatial granularity. Each element in

Mcoarse

determines the computation

of an

S×S

-sized feature patch. For instance, the feature size in the ﬁrst ResNet stage

56 ×56

Then the possible choices for

are

{1,2,4,7,8,14,28,56}.

The mask

Mcoarse

is then upsampled

to the size of

H×W

. Notably,

S= 1

means that the granularity is still at the pixel level as previous

methods [

]. Note that the other extreme situation (

S= 56

) is not considered in this paper,

when the masker directly determines whether to skip the whole block (i.e. layer skipping [

]).

Such an overly aggressive approach will lead to considerable drop of accuracy, as we presented in

Appendix C.2. The masker is composed of a pooling layer followed by a 1×1convolution.

Differences to existing works.

Without using the interpolation operation [

] or the carefully

designed two-branch structure [

], the proposed block architecture is simple and sufﬁciently general

to be plugged into most backbones with minimal modiﬁcation. Our formulation is mostly similar to

that in [

], which could be viewed as a variant of our method with the spatial granularity

=1 for all

blocks. Instead of performing spatially adaptive inference at the ﬁnest pixel level, our granularity

optimized under the guidance of our latency prediction model (details are presented in the following

Sec. 4.2) to achieve realistic speedup on target computing platforms.

3.3 Latency prediction model

As stated before, it is laborious to evaluate the latency of dynamic operators on different hardware

platforms. To efﬁciently seek preferable granularity settings on arbitrary hardware devices, we

propose a latency prediction model

, which can directly predict the delay of executing dynamic

operators on any target devices. For a spatial-wise dynamic convolutional block, the latency predictor

takes the hardware properties

, the layer parameters

, the spatial granularity

, and the activation

rate ras input and predicts the latency `of a dynamic convolutional block: `=G(H,P, S, r).

Hardware modeling.

We model a hardware device as multiple processing engines (PEs), and parallel

computation can be executed on these PEs. As shown in Figure 3, we model the memory system as a

Here we refer to a stage as the cascading of multiple blocks which process features with the same resolution.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Latency-awareSpatial-wiseDynamicNetworksYizengHan1ZhihangYuan2YifanPu1ChenhaoXue2ShijiSong1GuangyuSun2GaoHuang1y1DepartmentofAutomation,BNRist,TsinghuaUniversity,Beijing,China2SchoolofElectronicsEngineeringandComputerScience,PekingUniversity,Beijing,China{hanyz18,pyf20}@mails.tsinghua.edu.cn,{shi...

展开>> 收起<<

Latency-aware Spatial-wise Dynamic Networks Yizeng Han1Zhihang Yuan2Yifan Pu1Chenhao Xue2 Shiji Song1Guangyu Sun2Gao Huang1y.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Latency-aware Spatial-wise Dynamic Networks Yizeng Han1Zhihang Yuan2Yifan Pu1Chenhao Xue2 Shiji Song1Guangyu Sun2Gao Huang1y

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: