neuron group to reflect and encourage latency reduction. To this end, we first rank the neurons
according to their importance estimates, and then dynamically adjust their latency contributions. With
neurons re-calibrated towards the hardware-aware latency curve, we now select remaining neurons to
maximize the gradient-based importance estimates for accuracy, within the total latency constraint.
This makes the entire neuron ranking solvable under the knapsack paradigm. To enforce the neuron
selection order in a layer to be from the most important to the least, we have enhanced the knapsack
solver so that the calculated latency contributions of the remaining neurons would hold. HALP
surpasses prior art in pruning efficacy, see Fig.
2
and the more detailed analysis in Section
4
. Our
main contributions are summarized as follows:
•
We propose a latency-driven structured pruning algorithm that exploits hardware latency traits
to yield direct inference speedups.
•
We orient the pruning process around a quick yet highly effective knapsack scheme that seeks
for a combination of remaining neuron groups to maximize importance while constraining to
the target latency.
•
We introduce a group size adjustment scheme for knapsack solver amid varying latency contri-
butions across layers, hence allowing full exploitation of the latency landscape of the underlying
hardware.
•
We compare to prior art when pruning ResNet, MobileNet, VGG architectures on ImageNet,
CIFAR10, PASCAL VOC and demonstrate that our method yields consistent latency and
accuracy improvements over state-of-the-art methods. Our ImageNet pruning results present a
viable 1.6×to 1.9×speedup while preserving very similar original accuracy of the ResNets.
2 Related work
Pruning methods
. Depending on when to perform pruning, current methods can generally be
divided into three groups [
14
]: i) prune pretrained models [
20
,
40
,
33
,
22
,
46
,
45
,
16
], ii) prune at
initialization [
15
,
30
,
10
], and iii) prune during training [
2
,
17
,
41
]. Despite notable progresses in the
later two streams, pruning pretrained models remains as the most popular paradigm, with structural
sparsity favored by off-the-shelf inference platforms such as GPU.
To improve inference efficiency, many early pruning methods trim down the neural network aiming
to achieve a high channel pruning ratio while maintaining an acceptable accuracy. The estimation of
neuron importance has been widely studied in literature [
25
,
40
,
46
]. For example, [
45
] proposes
to use Taylor expansion to measure the importance of neurons and prunes a desired number of
least-ranked neurons. However, a channel pruning ratio does not directly translate into computation
reduction ratio, amid the fact that a neuron at different location leads to different computations.
There are recent methods that focus primarily on reducing FLOPs. Some of them take FLOPs into
consideration when calculating the neuron importance to encourage penalizing neurons that induce
high computations [
66
]. An alternative line of work propose to select the best pruned network from a
set of candidates [
32
,
68
]. However, it would take a long time for candidate selection due to the large
amount of candidates. In addition, these methods use FLOPs as a proxy of latency, which is usually
inaccurate as networks with similar FLOPs might have significantly different latencies [58].
Latency-aware compression.
Emerging compression techniques shift attention to directly cut down
on latency. One popular stream is Neural Architecture Search (NAS) methods [
9
,
12
,
58
,
65
] that
adaptively adjusts the architecture of the network for a given latency requirement. They incorporate
the platform constraints into the optimization process in both the architecture and parameter space
to jointly optimize the model size and accuracy. Despite remarkable insights, NAS methods remain
computationally expensive in general compared to their pruning counterparts.
Latency-oriented pruning has also gained a growing amount of attention. [
6
] presents a framework
for network compression under operational constraints, using Bayesian optimization to iteratively
obtain compression hyperparameters that satisfy the constraints. Along the same line, NetAdapt [
68
]
iteratively prunes neurons across layers under the guidance of the empirical latency measurements on
the targeting platform. While these methods push the frontier of latency constrained pruning, the
hardware-incurred latency surface in fact offers much more potential under our enhanced pruning
policy - as we show later, large rooms for improvements remain unexploited and realizable.
3