Structural Pruning via Latency-Saliency Knapsack Maying Shen Hongxu Yin Pavlo Molchanov Lei Mao Jianna Liu Jose M. Alvarez NVIDIA

2025-05-02 0 0 1.74MB 26 页 10玖币

侵权投诉

Structural Pruning via Latency-Saliency Knapsack

Maying Shen∗

, Hongxu Yin∗

, Pavlo Molchanov, Lei Mao, Jianna Liu, Jose M. Alvarez

NVIDIA

{mshen,dannyy,pmolchanov,lmao,jiannal,josea}@nvidia.com

Abstract

Structural pruning can simplify network architecture and improve inference speed.

We propose Hardware-Aware Latency Pruning (HALP) that formulates structural

pruning as a global resource allocation optimization problem, aiming at maximiz-

ing the accuracy while constraining latency under a predeﬁned budget on targeting

device. For ﬁlter importance ranking, HALP leverages latency lookup table to track

latency reduction potential and global saliency score to gauge accuracy drop. Both

metrics can be evaluated very efﬁciently during pruning, allowing us to reformu-

late global structural pruning under a reward maximization problem given target

constraint. This makes the problem solvable via our augmented knapsack solver,

enabling HALP to surpass prior work in pruning efﬁcacy and accuracy-efﬁciency

trade-off. We examine HALP on both classiﬁcation and detection tasks, over vary-

ing networks, on ImageNet and VOC datasets, on different platforms. In particular,

for ResNet-50/-101 pruning on ImageNet, HALP improves network throughput

1.60×

1.90×

with

+0.3%

−0.2%

top-1 accuracy changes, respectively. For

SSD pruning on VOC, HALP improves throughput by

1.94×

with only a

0.56

mAP drop. HALP consistently outperforms prior art, sometimes by large margins.

Project page at https://halp-neurips.github.io/.

1 Introduction

Convolutional Neural Networks (CNNs) act as the central tenet behind the rapid development in

computer vision tasks such as classiﬁcation, detection, segmentation, image synthesis, among others.

As performance boosts, so do model size, computation, and latency. With millions, sometimes billions

of parameters (e.g., GPT-3 [

]), modern neural networks face increasing challenges upon ubiquitous

deployment, that mostly faces stringent constraints such as energy and latency [

]. In

certain cases like autonomous driving, a breach of real-time constraint not only undermines user

experience, but also imposes critical safety concerns. Even for cloud service, speeding up the

inference directly translates into higher throughput, allowing more clients and users to beneﬁt from

the service.

One effective and efﬁcient method to reduce model complexity is network pruning. The primary

goal of pruning is to remove the parameters, along with their computation, that are deemed least

important for inference [

]. Compatible with other compression streams of work such as

quantization [

], dynamic inference [

], and distillation [

], pruning enables

a ﬂexible tuning of model complexity towards varying constraints, while requiring much less design

efforts by neural architecture search [

] and architecture re-designs [

]. Thus, in this

work, we study pruning, in particular structured pruning that reduces channels to beneﬁt off-the-shelf

platforms, e.g., GPUs.

As the pruning literature develops, the pruning criteria also evolve to better reﬂect ﬁnal efﬁciency. The

early phase of the ﬁeld focuses on maximum parameter removal in seek for minimum representations

∗Equal contribution.

arXiv:2210.06659v2 [cs.CV] 18 Oct 2022

Figure 1: The proposed hardware-aware latency pruning (HALP) paradigm. Considering both perfor-

mance and latency contributions, HALP formulates global structural pruning as a global resource

allocation problem (Section

3.1

), solvable using our augmented Knapsack algorithm (Section

3.2

Pruned architectures surpass prior work across varying latency constraints given changing network

architectures for both classiﬁcation and detection tasks (Section 4).

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

inference time (ms/image)

Top1 Acc(%)

1 1.5 2 2.5 3 3.5 4 4.5

FLOPs (G)

Original model EagleEye Li et al. ECCV'20 AutoSlim Yu et al. NeurIPS'19 r x RN50 He et al. CVPR'16

MetaPruning Liu et al. ICCV'19 GReg Wang et al. ICLR'21 CHIP Sui et al. NeurIPS'21 GBN You et al. NeurIPS'19

QCQP Jeong et al. AISTATS'22 HALP (Ours) HALP + distillation (Ours)

Figure 2: Pruning ResNet50 on the ImageNet dataset. The proposed HALP surpasses state-of-the-art

structured pruning methods over accuracy, latency, and FLOPs metrics. Target hardware is NVIDIA

Titan V GPU. Top-left is better.

of the pretrained model. This leads to a ﬂourish of approaches that rank neurons effectively to

measure their importance [

]. As each neuron/ﬁlter possesses intrinsically different computation,

following works explore proxy to enhance redundancy removal, FLOPs being one of the most widely

adopted metrics [

] to reﬂect how many multiplication and addition computes needed for

the model. However, for models with very similar FLOPs, their latency can vary signiﬁcantly [

Recently, more and more works start directly working on reducing latency [

]. However, not

much was done in the ﬁeld of GPU friendly pruning methods due to non-trivial latency-architecture

trade-off. For example, as recently observed in [

], GPU usually imposes staircase-shaped latency

patterns for convolutional operators with varying channels, which inevitably occur per varying

pruning rate, see the latency surface in Fig.

. This imposes a constraint that pruning needs to be done

in groups to achieve latency improvement. Moreover, getting the exact look-up table of layers under

different pruning conﬁgurations will beneﬁt maximizing performance while minimizing latency.

Pruning different layers in the deep neural network will result in different accuracy-latency trade-off.

Typically, removing channels from the latter layers has smaller impact on accuracy and smaller impact

on latency versus removing channels from the early layers. We ask the question, if it is better to

remove more neurons from latter layer or less from early layer to achieve the same accuracy-latency

trade-off. By nature, the problem is combinatorial and requires the appropriate solution.

In this paper, we propose hardware-aware latency pruning (HALP) that formulates pruning as a

resource allocation optimization problem to maximize the accuracy while maintaining a latency

budget on the targeting device. The overall workﬂow is shown in Fig.

. For latency estimate per

pruned architecture, we pre-analyze the operator-level latency values by creating a look-up table for

every layer of the model on the target hardware. Then we introduce an additional score for each

neuron group to reﬂect and encourage latency reduction. To this end, we ﬁrst rank the neurons

according to their importance estimates, and then dynamically adjust their latency contributions. With

neurons re-calibrated towards the hardware-aware latency curve, we now select remaining neurons to

maximize the gradient-based importance estimates for accuracy, within the total latency constraint.

This makes the entire neuron ranking solvable under the knapsack paradigm. To enforce the neuron

selection order in a layer to be from the most important to the least, we have enhanced the knapsack

solver so that the calculated latency contributions of the remaining neurons would hold. HALP

surpasses prior art in pruning efﬁcacy, see Fig.

and the more detailed analysis in Section

. Our

main contributions are summarized as follows:

•

We propose a latency-driven structured pruning algorithm that exploits hardware latency traits

to yield direct inference speedups.

•

We orient the pruning process around a quick yet highly effective knapsack scheme that seeks

for a combination of remaining neuron groups to maximize importance while constraining to

the target latency.

•

We introduce a group size adjustment scheme for knapsack solver amid varying latency contri-

butions across layers, hence allowing full exploitation of the latency landscape of the underlying

hardware.

•

We compare to prior art when pruning ResNet, MobileNet, VGG architectures on ImageNet,

CIFAR10, PASCAL VOC and demonstrate that our method yields consistent latency and

accuracy improvements over state-of-the-art methods. Our ImageNet pruning results present a

viable 1.6×to 1.9×speedup while preserving very similar original accuracy of the ResNets.

2 Related work

Pruning methods

. Depending on when to perform pruning, current methods can generally be

divided into three groups [

]: i) prune pretrained models [

], ii) prune at

initialization [

], and iii) prune during training [

]. Despite notable progresses in the

later two streams, pruning pretrained models remains as the most popular paradigm, with structural

sparsity favored by off-the-shelf inference platforms such as GPU.

To improve inference efﬁciency, many early pruning methods trim down the neural network aiming

to achieve a high channel pruning ratio while maintaining an acceptable accuracy. The estimation of

neuron importance has been widely studied in literature [

]. For example, [

] proposes

to use Taylor expansion to measure the importance of neurons and prunes a desired number of

least-ranked neurons. However, a channel pruning ratio does not directly translate into computation

reduction ratio, amid the fact that a neuron at different location leads to different computations.

There are recent methods that focus primarily on reducing FLOPs. Some of them take FLOPs into

consideration when calculating the neuron importance to encourage penalizing neurons that induce

high computations [

]. An alternative line of work propose to select the best pruned network from a

set of candidates [

]. However, it would take a long time for candidate selection due to the large

amount of candidates. In addition, these methods use FLOPs as a proxy of latency, which is usually

inaccurate as networks with similar FLOPs might have signiﬁcantly different latencies [58].

Latency-aware compression.

Emerging compression techniques shift attention to directly cut down

on latency. One popular stream is Neural Architecture Search (NAS) methods [

] that

adaptively adjusts the architecture of the network for a given latency requirement. They incorporate

the platform constraints into the optimization process in both the architecture and parameter space

to jointly optimize the model size and accuracy. Despite remarkable insights, NAS methods remain

computationally expensive in general compared to their pruning counterparts.

Latency-oriented pruning has also gained a growing amount of attention. [

] presents a framework

for network compression under operational constraints, using Bayesian optimization to iteratively

obtain compression hyperparameters that satisfy the constraints. Along the same line, NetAdapt [

]

iteratively prunes neurons across layers under the guidance of the empirical latency measurements on

the targeting platform. While these methods push the frontier of latency constrained pruning, the

hardware-incurred latency surface in fact offers much more potential under our enhanced pruning

policy - as we show later, large rooms for improvements remain unexploited and realizable.

3 Method

In this section, we ﬁrst formulate the pruning process as an optimization process, before diving deep

into the importance estimation for accuracy and latency. Then, we elaborate on how to solve the

optimization via knapsack regime, augmented by dynamic grouping of neurons. We ﬁnalize the

method by combining these key steps under one realm of HALP.

3.1 Objective Function

Consider a neural network that consists of

layers performing linear operations on their inputs,

together with non-linear activation layers and potentially pooling layers. Suppose there are

neurons

(output channels) in the

lth

layer and each neuron is encoded by parameters

l∈RCin

l×Kl×Kl

where

Cin

is the number of input channels and

is the kernel size. By putting all the neurons across

the network together, we get the neuron parameter set

W={{Wn

l}Nl

n=1}L

l=1

, where

N=PL

l=1 Nl

is the total number of neurons in the network.

Given a training set

D={(xi, yi)}M

i=1

, the problem of network pruning with a given constraint

can be generally formulated as the following optimization problem:

arg min

L(ˆ

W,D)s.t. Φf(ˆ

W, xi)≤C(1)

where

W⊂W

is the remaining parameters after pruning and

is the loss of the task.

f(·)

encodes the

network function, and

Φ(·)

maps the network to the constraint

, such as latency, FLOPs, or memory.

We primarily focus on latency in this work while the method easily scales to other constraints.

The key to solving the aforementioned problem relies on identifying the portion of the network that

satisﬁes the constraint while incurring minimum performance disruption:

arg max

p1,··· ,pl

l=1

Il(pl),s.t.

l=1

Tl(pl−1, pl)≤C, ∀l0≤pl≤Nl(2)

where

denotes the number of kept neurons at layer

Il(pl)

signals the maximum importance to

the ﬁnal accuracy with

neurons, and

Tl(pl−1, pl)

checks on the associated latency contribution of

layer

with

pl−1

input channels and

output channels.

denotes a ﬁxed input channel number for

the ﬁrst convolutional block, e.g., 3for RGB images. We next elaborate on I(·)and T(·)in detail.

Importance score.

To get the importance score of a layer to ﬁnal accuracy, namely

Il(pl)

in Eq.

we take it as the accumulated score from individual neurons

Ppl

j=1 Ij

. We ﬁrst approximate the

importance of neurons using the Taylor expansion of the loss change [

]. Speciﬁcally, we prune on

batch normalization layers and the importance of the n-th neuron in the l-th layer is calculated as

l=

gγn

lγn

l+gβn

lβn

l

,(3)

where

denotes the gradient of the weight,

γn

and

βn

are the corresponding weight and bias from

the batch normalization layer, respectively. Unlike a squared loss in [

], we use absolute difference

as we observe slight improvements.

In order to maximize the total importance, we keep the most important neurons at a higher priority.

To this end, we rank the neurons in the

lth

layer according to their importance score in a descending

order and denote the importance score of the jth-ranked neuron as Ij

l, thus we have

Il(pl) =

j=1

l,0≤pl≤Nl,I1

l≥ · · · ≥ INl

l.(4)

Latency contribution.

We empirically obtain the layer latency

Tl(pl−1, pl)

in Eq.

by pre-building

a layer-wise look-up table with pre-measured latencies. This layer latency corresponds to the

aggregation of the neuron latency contribution of each neuron in the layer, cj

Tl(pl−1, pl) =

j=1

l,0≤pl≤Nl.(5)

The latency contribution of the

-th neuron in the

-th layer can also be computed using the entries in

the look up table as:

l=Tl(pl−1, j)−Tl(pl−1, j −1),1≤j≤pl.(6)

Algorithm 1 Augmented Knapsack Solver

Input: Importance score {Il∈RNl}L

l=1 where Ilis sorted descendingly; Neuron latency contribution {cl∈RNl}L

l=1; Latency constraint C.

1: maxV ∈R(C+1), keep ∈RL×(C+1) maxV[c]: max importance under constraint c; keep[l,c]: # neurons to keep in layer lto achieve maxI[c]

2: for l= 1,...,Ldo

3: for j= 1,...,Nldo

4: for c= 1,...,Cdo

5: vkeep =Ij

l+maxV[c−cj

l],vprune =maxV[c]total importance can achieve under constraint cwith object nbeing kept or not

6: if vkeep > vprune and keep[l, c −cj

l] == j−1then check if it leads to higher score and if more important neurons in layer are kept

7: keep[l, c] = j, update_maxV[c] = vkeep

8: else

9: keep[l, c] = keep[l, c −1], update_maxV[c] = vprune

10: end if

11: end for

12: maxV ←update_maxV

13: end for

14: end for

15:

16: keep_n =to save the kept neurons in model

17: for l=L,...,1do retrieve the set of kept neurons

18: pl=keep[l, C]

19: keep_n ←keep_n ∪ {pltop ranked neurons in layer l}

20: C←C−Ppl

j=1 cj

21: end for

Output: Kept important neurons (keep_n).

In practice, we ﬁrst rank globally neurons by importance and then consider their latency contribution.

Thus, we can draw the following properties. If we remove the least important neuron in layer

, then

the number of neurons will change from

pl−1

, leading to a latency reduction

cpl

as this neuron’s

latency contribution score. We assign the potential latency reduction to neurons in the layer by the

importance order. The most important neuron in that layer would always have a latency contribution

. At this stage, ﬁnding the right combination of neurons to keep imposes a combinatorial problem,

and in the next section we tackle it via reward maximization considering latency and accuracy traits.

3.2 Augmented Knapsack Solver

Given both importance and latency estimates, we now aim at solving Eq.

. By plugging back in the

layer importance Eq. 4 and layer latency Eq. 5, we come to

max

l=1

j=1

l,s.t.

l=1

j=1

l≤C, 0≤pl≤Nl,I1

l≥ I2

l≥. . . INl

l.(7)

This simpliﬁes the overall pruning process into a knapsack problem only with additional preceding

constraints. The preceding enforcement originates from the fact that for a neuron with rank

in the

lth

layer, the neuron latency contribution only holds when all the neurons with rank

r= 1, . . . , j −1

are kept in the

lth

layer and the rest of the neurons with rank

r=j+ 1, j + 2,· · · , Nl

are removed.

Yet the problem is solvable by specifying each item with a list of preceding items that need to be

selected before its inclusion.

We augment the knapsack solver to consider the reordered neurons with descending importance score

so that all the preceding neurons will be processed before it. A description of the pseudo code of the

augmented knapsack solver is provided in Algo.

(a detailed explanation is provided in Appendix

The augmented solver is required to make sure that the latency cost is correct.

3.3 Neuron Grouping

Considering each neuron individually results in burdensome computation during pruning. We next

explore grouping neurons so that a number of them can be jointly considered and removed enabling

faster pruning [

]. Neuron grouping helps exploit hardware-incurred channel granularity guided

by the latency, speeds up knapsack solving of Eq.

, and yields structures that maximize the GPU

utilization, keeping as many parameters as possible under the similar latency. In addition, neuron

grouping simpliﬁes the knapsack problem that scales linearly with the number of candidates under

consideration (see Line 3of Alg. 1), thus speedups the solver as we’ll discuss later.

We refer to the difference of neuron counts between two latency cliffs of the staircase-patterned

latency as the latency step size. In our method, we group

channels in a layer as an entirety, where

the value of

is equal to the latency step size. The neurons are grouped by the order of importance.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

StructuralPruningviaLatency-SaliencyKnapsackMayingShen,HongxuYin,PavloMolchanov,LeiMao,JiannaLiu,JoseM.AlvarezNVIDIA{mshen,dannyy,pmolchanov,lmao,jiannal,josea}@nvidia.comAbstractStructuralpruningcansimplifynetworkarchitectureandimproveinferencespeed.WeproposeHardware-AwareLatencyPruning(HALP)that...

展开>> 收起<<

Structural Pruning via Latency-Saliency Knapsack Maying Shen Hongxu Yin Pavlo Molchanov Lei Mao Jianna Liu Jose M. Alvarez NVIDIA.pdf

共26页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Structural Pruning via Latency-Saliency Knapsack Maying Shen Hongxu Yin Pavlo Molchanov Lei Mao Jianna Liu Jose M. Alvarez NVIDIA

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: