
correlation among filters. Learning filter weights is left to CNN training. However, the scores
Sl
learned by PAAM, are then pointwise () multiplied by the feature maps Alof the same layer:
A0
l=SlAl(3)
The closer a filter score is to
1
, the more the corresponding feature map is preserved. Note also that
by using an analog value for scores while training the AN, allows PAAM to compare the relative
importance of scores later on. In particular, this is useful when PAAM employs the globally allocated
budget, and the cumulative score distribution, to select the filters to be used during CNN training.
2.3 Activation Function.
SoftMax is the typical choice of activation function in additive attention
when computing importance Vaswani et al. (2017b). However, SoftMax is not a suitable choice for
filter scores. While the range of its outputs is between
0
and
1
, the sum of its outputs has to be
1
,
meaning that either all scores will have a small value, or there will be only one score with value 1.
In contrast to SoftMax, we would intuitively want that many filter scores are possibly close to
1
.
More formally, the scores should have the following three main attributes: 1) All filter scores should
have a positive value that ranges between
0
and
1
, as is the case in SoftMax. 2) All filter scores should
adapt from their initial value of
1
, as we start with a completely dense model. 3) The filter-scores
activation function should have non-zero gradients over their entire input domain.
Figure 3: The leaky-exponen-
tial activation function.
Sigmoidal activations satisfy Attributes
1
and
3
. However, they have
difficulties with Attribute
2
. For high temperatures, sigmoids behave
like steps, and scores quickly converge to
0
or
1
. The chance these
scores change again is very low, as the gradient is close to zero at
this point. Conversely, for low temperatures, scores have a hard
time converging to
0
or
1
. Finding the optimal temperature needs an
extensive search for each of the layers separately. Finally, starting
from a dense network with all scores set to 1is not feasible.
To satisfy Attributes
1
-
3
, we designed our own activation function,
as shown in Figure 3. First, in order to ensure that scores are positive,
we use an exponential activation function and learn its logarithm
value. Second, we allow the activation to be leaky, by not saturating it at
1
, as this would result in
0
gradients, and scores getting stuck at
1
. Formally, our leaky-exponential activation function is defined
as follows, where ais a small value (an ablation study in provided in Appendix):
φ(x) = exif x < 0
1.0 + ax if x≥0(4)
2.4 Optimization Problem.
The PAAM optimization problem can be formulated as in Equation (5),
where
L
is the CNN loss function, and
f
is the CNN function with inputs
x
, labels
y
, and parameters
W
, modulated by the AN with inputs the filter weights, parameters
V
, and outputs the scores
S
. Let
kg(Sl, p)k1
denote the
l1
-norm of the binarized filter scores of layer
l
,
Fl
the number of filters of
layer
l
,
L
the number of layers, and
p
the pruning ratio. Then PAAM should preserve only as many
filters as desired by p, while minimizing at the same time the loss function.
min
VL(y, f(x;W, S)) s.t.
L
X
l=0 kg(Sl, p)k1−p
L
X
l=0
Fl= 0 (5)
Function
g(Sl, p)
is discussed in next section. The budget constraint is addressed by adding an
l1
regularization term to the loss function, while training the weights of the AN. This term employs the
analog scores of the filters in each layer, which are computed as described above. Formally:
L0(S) = L(y, f(x;W, S)) + λ
L
X
l=0 kSlk1(6)
where
λ
is a global hyper-parameter controlling the regularizer’s effect on the total loss. Since the
loss function consists now of both classification accuracy and the
l1
-norm cost of the scores, reducing
filter scores to decrease l1cost, directly influences accuracy as scores multiply the activation maps.
In more detail, by incorporating the classification and the
l1
-norm of the scores in the loss function,
the effect of the score of each filter is accounted for in the loss value in two ways:
4