IMPROVED PROJECTION LEARNING FOR LOWER DIMENSIONAL FEATURE MAPS Ilan Price Mathematical Institute

2025-05-08 0 0 661.08KB 5 页 10玖币
侵权投诉
IMPROVED PROJECTION LEARNING FOR LOWER DIMENSIONAL FEATURE MAPS
Ilan Price
Mathematical Institute
University of Oxford
& The Alan Turing Institute
Jared Tanner
Mathematical Institute
University of Oxford
ABSTRACT
The requirement to repeatedly move large feature maps off-
and on-chip during inference with convolutional neural net-
works (CNNs) imposes high costs in terms of both energy
and time. In this work we explore an improved method for
compressing all feature maps of pre-trained CNNs to below a
specified limit. This is done by means of learned projections
trained via end-to-end finetuning, which can then be folded
and fused into the pre-trained network. We also introduce a
new ‘ceiling compression’ framework in which evaluate such
techniques in view of the future goal of performing inference
fully on-chip.
Index Termsefficient deep learning, convolutional
neural networks, feature map compression
1. INTRODUCTION
Modern neural network architectures can achieve high accu-
racy while possessing far fewer trainable parameters than had
traditionally been expected. Prototypical examples include
compact weights and shared parameters in convolutional and
recurrent neural networks respectively, as well as sparsifying
and quantizing the weights within such networks, see [1, 2]
and references therein. The efficiencies in storing and trans-
mitting these networks stand in stark contrast to their effi-
ciency at inference time which is determined not only by the
size of the model itself (weights, biases, etc), but increas-
ingly importantly by the intermediate feature-maps (represen-
tations) generated as the outputs of successive layers and in-
puts to the following layers. For example, with imagenet res-
olution (224x224) inputs, and even without sparsifying the
networks, the single largest feature map is 7% of the model
size in Resnet18, 2.3% in VGG16, and 34% in MobilenetV2.
The relative model to feature map sizes can be dramatically
exacerbated as these networks can have their number of pa-
rameters reduced to only a tiny fraction of their original size
without loss in classification accuracy [1]. This motivates a
line of research to improve efficiency by compressing feature
maps too; see the related works Section 2.
Typically, however, not all feature maps need to be stored
simultaneously - once a feature map has been used as the in-
put for the following layer(s), it can be immediately deleted.
Herein we propose an improved method for learning low-rank
projections which can be incorporated into pre-trained CNNs
to reduce their maximal memory requirements. So doing, this
approach seeks to both reduce the memory requirements on
a device, and ideally to eliminate off-chip memory access
mid-forward-pass, which can dominate power usage [3, 4],
a goal which is would enable lower-power, edge-device de-
ployed deep networks.
2. RELATED WORK
One strand of research on feature map compression makes use
of (and often tries to increase) the sparsity of post-activation
feature maps, which endows a natural compressibility. Works
such as [5], [6], and [7, 8], develop accelerators which use
Zero Run-length encoding, compressed column storage, and
zero-value compression, respectively, to leverage the natu-
rally occurring feature map sparsity to shrink memory access.
In [9, 10], the authors leverage both ‘sparsity’ and ‘smooth-
ness’ of feature maps, by decomposing the streamed input
into zero and non-zero stream, and applying run-length en-
coding to compress the the lengthy runs of zeros. They then
compress the non-zero values with bit-plane compression.
Similarly, [11] induces sparse feature maps by adding L1
regularisation when finetuning pre-trained networks. Further-
more, they use linear (uniform) quantisation of the feature
maps, as well as entropy coding for the resulting sparse fea-
ture maps. Lastly, [12] proposes an alternative method for
inducing sparse activations, based on finetuning with Hoyer-
sparsity regularisation and a new discontinuous Forced Acti-
vation Threshold ReLU (FATReLU) defined for some thresh-
old Tas 0if x < T and xotherwise. Given that the sparsity
of the feature maps of the aforementioned methods is in-
put dependent, the compressibility and resulting compute
resources needed for inference with this class of methods are
not guaranteed or known ahead of time.
An alternative line of work focuses on transform-based
compression. In [13], they propose applying a 1D Discrete
Cosine Transform (DCT) on the channel dimension of all fea-
ture maps, which are then masked and then zero value coded.
arXiv:2210.15170v1 [cs.LG] 27 Oct 2022
摘要:

IMPROVEDPROJECTIONLEARNINGFORLOWERDIMENSIONALFEATUREMAPSIlanPriceMathematicalInstituteUniversityofOxford&TheAlanTuringInstituteJaredTannerMathematicalInstituteUniversityofOxfordABSTRACTTherequirementtorepeatedlymovelargefeaturemapsoff-andon-chipduringinferencewithconvolutionalneuralnet-works(CNNs)im...

展开>> 收起<<
IMPROVED PROJECTION LEARNING FOR LOWER DIMENSIONAL FEATURE MAPS Ilan Price Mathematical Institute.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:5 页 大小:661.08KB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注