Structural Pruning via Latency-Saliency Knapsack Maying Shen Hongxu Yin Pavlo Molchanov Lei Mao Jianna Liu Jose M. Alvarez NVIDIA

2025-05-02 0 0 1.74MB 26 页 10玖币
侵权投诉
Structural Pruning via Latency-Saliency Knapsack
Maying Shen
, Hongxu Yin
, Pavlo Molchanov, Lei Mao, Jianna Liu, Jose M. Alvarez
NVIDIA
{mshen,dannyy,pmolchanov,lmao,jiannal,josea}@nvidia.com
Abstract
Structural pruning can simplify network architecture and improve inference speed.
We propose Hardware-Aware Latency Pruning (HALP) that formulates structural
pruning as a global resource allocation optimization problem, aiming at maximiz-
ing the accuracy while constraining latency under a predefined budget on targeting
device. For filter importance ranking, HALP leverages latency lookup table to track
latency reduction potential and global saliency score to gauge accuracy drop. Both
metrics can be evaluated very efficiently during pruning, allowing us to reformu-
late global structural pruning under a reward maximization problem given target
constraint. This makes the problem solvable via our augmented knapsack solver,
enabling HALP to surpass prior work in pruning efficacy and accuracy-efficiency
trade-off. We examine HALP on both classification and detection tasks, over vary-
ing networks, on ImageNet and VOC datasets, on different platforms. In particular,
for ResNet-50/-101 pruning on ImageNet, HALP improves network throughput
by
1.60×
/
1.90×
with
+0.3%
/
0.2%
top-1 accuracy changes, respectively. For
SSD pruning on VOC, HALP improves throughput by
1.94×
with only a
0.56
mAP drop. HALP consistently outperforms prior art, sometimes by large margins.
Project page at https://halp-neurips.github.io/.
1 Introduction
Convolutional Neural Networks (CNNs) act as the central tenet behind the rapid development in
computer vision tasks such as classification, detection, segmentation, image synthesis, among others.
As performance boosts, so do model size, computation, and latency. With millions, sometimes billions
of parameters (e.g., GPT-3 [
4
]), modern neural networks face increasing challenges upon ubiquitous
deployment, that mostly faces stringent constraints such as energy and latency [
9
,
46
,
45
,
55
]. In
certain cases like autonomous driving, a breach of real-time constraint not only undermines user
experience, but also imposes critical safety concerns. Even for cloud service, speeding up the
inference directly translates into higher throughput, allowing more clients and users to benefit from
the service.
One effective and efficient method to reduce model complexity is network pruning. The primary
goal of pruning is to remove the parameters, along with their computation, that are deemed least
important for inference [
2
,
20
,
37
,
45
]. Compatible with other compression streams of work such as
quantization [
5
,
64
,
73
], dynamic inference [
31
,
67
,
70
], and distillation [
23
,
47
,
69
], pruning enables
a flexible tuning of model complexity towards varying constraints, while requiring much less design
efforts by neural architecture search [
58
,
61
,
65
] and architecture re-designs [
24
,
42
,
59
]. Thus, in this
work, we study pruning, in particular structured pruning that reduces channels to benefit off-the-shelf
platforms, e.g., GPUs.
As the pruning literature develops, the pruning criteria also evolve to better reflect final efficiency. The
early phase of the field focuses on maximum parameter removal in seek for minimum representations
Equal contribution.
arXiv:2210.06659v2 [cs.CV] 18 Oct 2022
Figure 1: The proposed hardware-aware latency pruning (HALP) paradigm. Considering both perfor-
mance and latency contributions, HALP formulates global structural pruning as a global resource
allocation problem (Section
3.1
), solvable using our augmented Knapsack algorithm (Section
3.2
).
Pruned architectures surpass prior work across varying latency constraints given changing network
architectures for both classification and detection tasks (Section 4).
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
inference time (ms/image)
72
73
74
75
76
77
78
79
Top1 Acc(%)
1 1.5 2 2.5 3 3.5 4 4.5
FLOPs (G)
Original model EagleEye Li et al. ECCV'20 AutoSlim Yu et al. NeurIPS'19 r x RN50 He et al. CVPR'16
MetaPruning Liu et al. ICCV'19 GReg Wang et al. ICLR'21 CHIP Sui et al. NeurIPS'21 GBN You et al. NeurIPS'19
QCQP Jeong et al. AISTATS'22 HALP (Ours) HALP + distillation (Ours)
Figure 2: Pruning ResNet50 on the ImageNet dataset. The proposed HALP surpasses state-of-the-art
structured pruning methods over accuracy, latency, and FLOPs metrics. Target hardware is NVIDIA
Titan V GPU. Top-left is better.
of the pretrained model. This leads to a flourish of approaches that rank neurons effectively to
measure their importance [
46
,
62
]. As each neuron/filter possesses intrinsically different computation,
following works explore proxy to enhance redundancy removal, FLOPs being one of the most widely
adopted metrics [
32
,
66
,
72
] to reflect how many multiplication and addition computes needed for
the model. However, for models with very similar FLOPs, their latency can vary significantly [
58
].
Recently, more and more works start directly working on reducing latency [
6
,
68
,
55
]. However, not
much was done in the field of GPU friendly pruning methods due to non-trivial latency-architecture
trade-off. For example, as recently observed in [
51
], GPU usually imposes staircase-shaped latency
patterns for convolutional operators with varying channels, which inevitably occur per varying
pruning rate, see the latency surface in Fig.
1
. This imposes a constraint that pruning needs to be done
in groups to achieve latency improvement. Moreover, getting the exact look-up table of layers under
different pruning configurations will benefit maximizing performance while minimizing latency.
Pruning different layers in the deep neural network will result in different accuracy-latency trade-off.
Typically, removing channels from the latter layers has smaller impact on accuracy and smaller impact
on latency versus removing channels from the early layers. We ask the question, if it is better to
remove more neurons from latter layer or less from early layer to achieve the same accuracy-latency
trade-off. By nature, the problem is combinatorial and requires the appropriate solution.
In this paper, we propose hardware-aware latency pruning (HALP) that formulates pruning as a
resource allocation optimization problem to maximize the accuracy while maintaining a latency
budget on the targeting device. The overall workflow is shown in Fig.
1
. For latency estimate per
pruned architecture, we pre-analyze the operator-level latency values by creating a look-up table for
every layer of the model on the target hardware. Then we introduce an additional score for each
2
neuron group to reflect and encourage latency reduction. To this end, we first rank the neurons
according to their importance estimates, and then dynamically adjust their latency contributions. With
neurons re-calibrated towards the hardware-aware latency curve, we now select remaining neurons to
maximize the gradient-based importance estimates for accuracy, within the total latency constraint.
This makes the entire neuron ranking solvable under the knapsack paradigm. To enforce the neuron
selection order in a layer to be from the most important to the least, we have enhanced the knapsack
solver so that the calculated latency contributions of the remaining neurons would hold. HALP
surpasses prior art in pruning efficacy, see Fig.
2
and the more detailed analysis in Section
4
. Our
main contributions are summarized as follows:
We propose a latency-driven structured pruning algorithm that exploits hardware latency traits
to yield direct inference speedups.
We orient the pruning process around a quick yet highly effective knapsack scheme that seeks
for a combination of remaining neuron groups to maximize importance while constraining to
the target latency.
We introduce a group size adjustment scheme for knapsack solver amid varying latency contri-
butions across layers, hence allowing full exploitation of the latency landscape of the underlying
hardware.
We compare to prior art when pruning ResNet, MobileNet, VGG architectures on ImageNet,
CIFAR10, PASCAL VOC and demonstrate that our method yields consistent latency and
accuracy improvements over state-of-the-art methods. Our ImageNet pruning results present a
viable 1.6×to 1.9×speedup while preserving very similar original accuracy of the ResNets.
2 Related work
Pruning methods
. Depending on when to perform pruning, current methods can generally be
divided into three groups [
14
]: i) prune pretrained models [
20
,
40
,
33
,
22
,
46
,
45
,
16
], ii) prune at
initialization [
15
,
30
,
10
], and iii) prune during training [
2
,
17
,
41
]. Despite notable progresses in the
later two streams, pruning pretrained models remains as the most popular paradigm, with structural
sparsity favored by off-the-shelf inference platforms such as GPU.
To improve inference efficiency, many early pruning methods trim down the neural network aiming
to achieve a high channel pruning ratio while maintaining an acceptable accuracy. The estimation of
neuron importance has been widely studied in literature [
25
,
40
,
46
]. For example, [
45
] proposes
to use Taylor expansion to measure the importance of neurons and prunes a desired number of
least-ranked neurons. However, a channel pruning ratio does not directly translate into computation
reduction ratio, amid the fact that a neuron at different location leads to different computations.
There are recent methods that focus primarily on reducing FLOPs. Some of them take FLOPs into
consideration when calculating the neuron importance to encourage penalizing neurons that induce
high computations [
66
]. An alternative line of work propose to select the best pruned network from a
set of candidates [
32
,
68
]. However, it would take a long time for candidate selection due to the large
amount of candidates. In addition, these methods use FLOPs as a proxy of latency, which is usually
inaccurate as networks with similar FLOPs might have significantly different latencies [58].
Latency-aware compression.
Emerging compression techniques shift attention to directly cut down
on latency. One popular stream is Neural Architecture Search (NAS) methods [
9
,
12
,
58
,
65
] that
adaptively adjusts the architecture of the network for a given latency requirement. They incorporate
the platform constraints into the optimization process in both the architecture and parameter space
to jointly optimize the model size and accuracy. Despite remarkable insights, NAS methods remain
computationally expensive in general compared to their pruning counterparts.
Latency-oriented pruning has also gained a growing amount of attention. [
6
] presents a framework
for network compression under operational constraints, using Bayesian optimization to iteratively
obtain compression hyperparameters that satisfy the constraints. Along the same line, NetAdapt [
68
]
iteratively prunes neurons across layers under the guidance of the empirical latency measurements on
the targeting platform. While these methods push the frontier of latency constrained pruning, the
hardware-incurred latency surface in fact offers much more potential under our enhanced pruning
policy - as we show later, large rooms for improvements remain unexploited and realizable.
3
3 Method
In this section, we first formulate the pruning process as an optimization process, before diving deep
into the importance estimation for accuracy and latency. Then, we elaborate on how to solve the
optimization via knapsack regime, augmented by dynamic grouping of neurons. We finalize the
method by combining these key steps under one realm of HALP.
3.1 Objective Function
Consider a neural network that consists of
L
layers performing linear operations on their inputs,
together with non-linear activation layers and potentially pooling layers. Suppose there are
Nl
neurons
(output channels) in the
lth
layer and each neuron is encoded by parameters
Wn
lRCin
l×Kl×Kl
,
where
Cin
is the number of input channels and
K
is the kernel size. By putting all the neurons across
the network together, we get the neuron parameter set
W={{Wn
l}Nl
n=1}L
l=1
, where
N=PL
l=1 Nl
is the total number of neurons in the network.
Given a training set
D={(xi, yi)}M
i=1
, the problem of network pruning with a given constraint
C
can be generally formulated as the following optimization problem:
arg min
ˆ
W
L(ˆ
W,D)s.t. Φf(ˆ
W, xi)C(1)
where
ˆ
WW
is the remaining parameters after pruning and
L
is the loss of the task.
f(·)
encodes the
network function, and
Φ(·)
maps the network to the constraint
C
, such as latency, FLOPs, or memory.
We primarily focus on latency in this work while the method easily scales to other constraints.
The key to solving the aforementioned problem relies on identifying the portion of the network that
satisfies the constraint while incurring minimum performance disruption:
arg max
p1,··· ,pl
L
X
l=1
Il(pl),s.t.
L
X
l=1
Tl(pl1, pl)C, l0plNl(2)
where
pl
denotes the number of kept neurons at layer
l
,
Il(pl)
signals the maximum importance to
the final accuracy with
pl
neurons, and
Tl(pl1, pl)
checks on the associated latency contribution of
layer
l
with
pl1
input channels and
pl
output channels.
p0
denotes a fixed input channel number for
the first convolutional block, e.g., 3for RGB images. We next elaborate on I(·)and T(·)in detail.
Importance score.
To get the importance score of a layer to final accuracy, namely
Il(pl)
in Eq.
2
,
we take it as the accumulated score from individual neurons
Ppl
j=1 Ij
l
. We first approximate the
importance of neurons using the Taylor expansion of the loss change [
45
]. Specifically, we prune on
batch normalization layers and the importance of the n-th neuron in the l-th layer is calculated as
In
l=
gγn
lγn
l+gβn
lβn
l
,(3)
where
g
denotes the gradient of the weight,
γn
l
and
βn
l
are the corresponding weight and bias from
the batch normalization layer, respectively. Unlike a squared loss in [
45
], we use absolute difference
as we observe slight improvements.
In order to maximize the total importance, we keep the most important neurons at a higher priority.
To this end, we rank the neurons in the
lth
layer according to their importance score in a descending
order and denote the importance score of the jth-ranked neuron as Ij
l, thus we have
Il(pl) =
pl
X
j=1
Ij
l,0plNl,I1
l≥ · · · ≥ INl
l.(4)
Latency contribution.
We empirically obtain the layer latency
Tl(pl1, pl)
in Eq.
2
by pre-building
a layer-wise look-up table with pre-measured latencies. This layer latency corresponds to the
aggregation of the neuron latency contribution of each neuron in the layer, cj
l:
Tl(pl1, pl) =
pl
X
j=1
cj
l,0plNl.(5)
The latency contribution of the
j
-th neuron in the
l
-th layer can also be computed using the entries in
the look up table as:
cj
l=Tl(pl1, j)Tl(pl1, j 1),1jpl.(6)
4
Algorithm 1 Augmented Knapsack Solver
Input: Importance score {IlRNl}L
l=1 where Ilis sorted descendingly; Neuron latency contribution {clRNl}L
l=1; Latency constraint C.
1: maxV R(C+1), keep RL×(C+1) maxV[c]: max importance under constraint c; keep[l,c]: # neurons to keep in layer lto achieve maxI[c]
2: for l= 1,...,Ldo
3: for j= 1,...,Nldo
4: for c= 1,...,Cdo
5: vkeep =Ij
l+maxV[ccj
l],vprune =maxV[c]total importance can achieve under constraint cwith object nbeing kept or not
6: if vkeep > vprune and keep[l, c cj
l] == j1then check if it leads to higher score and if more important neurons in layer are kept
7: keep[l, c] = j, update_maxV[c] = vkeep
8: else
9: keep[l, c] = keep[l, c 1], update_maxV[c] = vprune
10: end if
11: end for
12: maxV update_maxV
13: end for
14: end for
15:
16: keep_n =to save the kept neurons in model
17: for l=L,...,1do retrieve the set of kept neurons
18: pl=keep[l, C]
19: keep_n keep_n ∪ {pltop ranked neurons in layer l}
20: CCPpl
j=1 cj
l
21: end for
Output: Kept important neurons (keep_n).
In practice, we first rank globally neurons by importance and then consider their latency contribution.
Thus, we can draw the following properties. If we remove the least important neuron in layer
l
, then
the number of neurons will change from
pl
to
pl1
, leading to a latency reduction
cpl
l
as this neuron’s
latency contribution score. We assign the potential latency reduction to neurons in the layer by the
importance order. The most important neuron in that layer would always have a latency contribution
c1
l
. At this stage, finding the right combination of neurons to keep imposes a combinatorial problem,
and in the next section we tackle it via reward maximization considering latency and accuracy traits.
3.2 Augmented Knapsack Solver
Given both importance and latency estimates, we now aim at solving Eq.
2
. By plugging back in the
layer importance Eq. 4 and layer latency Eq. 5, we come to
max
L
X
l=1
pl
X
j=1
Ij
l,s.t.
L
X
l=1
pl
X
j=1
cj
lC, 0plNl,I1
l≥ I2
l. . . INl
l.(7)
This simplifies the overall pruning process into a knapsack problem only with additional preceding
constraints. The preceding enforcement originates from the fact that for a neuron with rank
j
in the
lth
layer, the neuron latency contribution only holds when all the neurons with rank
r= 1, . . . , j 1
are kept in the
lth
layer and the rest of the neurons with rank
r=j+ 1, j + 2,· · · , Nl
are removed.
Yet the problem is solvable by specifying each item with a list of preceding items that need to be
selected before its inclusion.
We augment the knapsack solver to consider the reordered neurons with descending importance score
so that all the preceding neurons will be processed before it. A description of the pseudo code of the
augmented knapsack solver is provided in Algo.
1
(a detailed explanation is provided in Appendix
A
).
The augmented solver is required to make sure that the latency cost is correct.
3.3 Neuron Grouping
Considering each neuron individually results in burdensome computation during pruning. We next
explore grouping neurons so that a number of them can be jointly considered and removed enabling
faster pruning [
69
]. Neuron grouping helps exploit hardware-incurred channel granularity guided
by the latency, speeds up knapsack solving of Eq.
7
, and yields structures that maximize the GPU
utilization, keeping as many parameters as possible under the similar latency. In addition, neuron
grouping simplifies the knapsack problem that scales linearly with the number of candidates under
consideration (see Line 3of Alg. 1), thus speedups the solver as we’ll discuss later.
We refer to the difference of neuron counts between two latency cliffs of the staircase-patterned
latency as the latency step size. In our method, we group
s
channels in a layer as an entirety, where
the value of
s
is equal to the latency step size. The neurons are grouped by the order of importance.
5
摘要:

StructuralPruningviaLatency-SaliencyKnapsackMayingShen,HongxuYin,PavloMolchanov,LeiMao,JiannaLiu,JoseM.AlvarezNVIDIA{mshen,dannyy,pmolchanov,lmao,jiannal,josea}@nvidia.comAbstractStructuralpruningcansimplifynetworkarchitectureandimproveinferencespeed.WeproposeHardware-AwareLatencyPruning(HALP)that...

展开>> 收起<<
Structural Pruning via Latency-Saliency Knapsack Maying Shen Hongxu Yin Pavlo Molchanov Lei Mao Jianna Liu Jose M. Alvarez NVIDIA.pdf

共26页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:26 页 大小:1.74MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 26
客服
关注