IMPROVED PROJECTION LEARNING FOR LOWER DIMENSIONAL FEATURE MAPS
Ilan Price
Mathematical Institute
University of Oxford
& The Alan Turing Institute
Jared Tanner
Mathematical Institute
University of Oxford
ABSTRACT
The requirement to repeatedly move large feature maps off-
and on-chip during inference with convolutional neural net-
works (CNNs) imposes high costs in terms of both energy
and time. In this work we explore an improved method for
compressing all feature maps of pre-trained CNNs to below a
specified limit. This is done by means of learned projections
trained via end-to-end finetuning, which can then be folded
and fused into the pre-trained network. We also introduce a
new ‘ceiling compression’ framework in which evaluate such
techniques in view of the future goal of performing inference
fully on-chip.
Index Terms—efficient deep learning, convolutional
neural networks, feature map compression
1. INTRODUCTION
Modern neural network architectures can achieve high accu-
racy while possessing far fewer trainable parameters than had
traditionally been expected. Prototypical examples include
compact weights and shared parameters in convolutional and
recurrent neural networks respectively, as well as sparsifying
and quantizing the weights within such networks, see [1, 2]
and references therein. The efficiencies in storing and trans-
mitting these networks stand in stark contrast to their effi-
ciency at inference time which is determined not only by the
size of the model itself (weights, biases, etc), but increas-
ingly importantly by the intermediate feature-maps (represen-
tations) generated as the outputs of successive layers and in-
puts to the following layers. For example, with imagenet res-
olution (224x224) inputs, and even without sparsifying the
networks, the single largest feature map is 7% of the model
size in Resnet18, 2.3% in VGG16, and 34% in MobilenetV2.
The relative model to feature map sizes can be dramatically
exacerbated as these networks can have their number of pa-
rameters reduced to only a tiny fraction of their original size
without loss in classification accuracy [1]. This motivates a
line of research to improve efficiency by compressing feature
maps too; see the related works Section 2.
Typically, however, not all feature maps need to be stored
simultaneously - once a feature map has been used as the in-
put for the following layer(s), it can be immediately deleted.
Herein we propose an improved method for learning low-rank
projections which can be incorporated into pre-trained CNNs
to reduce their maximal memory requirements. So doing, this
approach seeks to both reduce the memory requirements on
a device, and ideally to eliminate off-chip memory access
mid-forward-pass, which can dominate power usage [3, 4],
a goal which is would enable lower-power, edge-device de-
ployed deep networks.
2. RELATED WORK
One strand of research on feature map compression makes use
of (and often tries to increase) the sparsity of post-activation
feature maps, which endows a natural compressibility. Works
such as [5], [6], and [7, 8], develop accelerators which use
Zero Run-length encoding, compressed column storage, and
zero-value compression, respectively, to leverage the natu-
rally occurring feature map sparsity to shrink memory access.
In [9, 10], the authors leverage both ‘sparsity’ and ‘smooth-
ness’ of feature maps, by decomposing the streamed input
into zero and non-zero stream, and applying run-length en-
coding to compress the the lengthy runs of zeros. They then
compress the non-zero values with bit-plane compression.
Similarly, [11] induces sparse feature maps by adding L1
regularisation when finetuning pre-trained networks. Further-
more, they use linear (uniform) quantisation of the feature
maps, as well as entropy coding for the resulting sparse fea-
ture maps. Lastly, [12] proposes an alternative method for
inducing sparse activations, based on finetuning with Hoyer-
sparsity regularisation and a new discontinuous Forced Acti-
vation Threshold ReLU (FATReLU) defined for some thresh-
old Tas 0if x < T and xotherwise. Given that the sparsity
of the feature maps of the aforementioned methods is in-
put dependent, the compressibility and resulting compute
resources needed for inference with this class of methods are
not guaranteed or known ahead of time.
An alternative line of work focuses on transform-based
compression. In [13], they propose applying a 1D Discrete
Cosine Transform (DCT) on the channel dimension of all fea-
ture maps, which are then masked and then zero value coded.
arXiv:2210.15170v1 [cs.LG] 27 Oct 2022