Weight Fixing Networks Christopher Subia-Waud Srinandan Dasmahapatra University of Southampton Southampton SO17 1BJ

2025-05-06 0 0 590.64KB 18 页 10玖币
侵权投诉
Weight Fixing Networks
Christopher Subia-Waud & Srinandan Dasmahapatra
University of Southampton, Southampton, SO17 1BJ
{cc2u18, sd}@soton.ac.uk
Abstract. Modern iterations of deep learning models contain millions
(billions) of unique parameters—each represented by a b-bit number.
Popular attempts at compressing neural networks (such as pruning and
quantisation) have shown that many of the parameters are superfluous,
which we can remove (pruning) or express with b0< b bits (quantisa-
tion) without hindering performance. Here we look to go much further
in minimising the information content of networks. Rather than a chan-
nel or layer-wise encoding, we look to lossless whole-network quantisa-
tion to minimise the entropy and number of unique parameters in a
network. We propose a new method, which we call Weight Fixing Net-
works (WFN) that we design to realise four model outcome objectives:
i) very few unique weights, ii) low-entropy weight encodings, iii) unique
weight values which are amenable to energy-saving versions of hardware
multiplication, and iv) lossless task-performance. Some of these goals are
conflicting. To best balance these conflicts, we combine a few novel (and
some well-trodden) tricks; a novel regularisation term, (i, ii) a view of
clustering cost as relative distance change (i, ii, iv), and a focus on whole-
network re-use of weights (i, iii). Our Imagenet experiments demonstrate
lossless compression using 56x fewer unique weights and a 1.9x lower
weight-space entropy than SOTA quantisation approaches. Code and
model saves can be found at github.com/subiawaud/Weight Fix Networks1.
Keywords: Compression, Quantization, Minimal Description Length,
Deep Learning Accelerators
1 Introduction
Deep learning models have a seemingly inexorable trajectory toward growth.
Growth in applicability, performance, investment, and optimism. Unfortunately,
one area of growth is lamentable - the ever-growing energy and storage costs
required to train and make predictions. To fully realise the promise of deep
learning methods, work is needed to reduce these costs without hindering task
performance.
Here we look to contribute a methodology and refocus towards the goal of
reducing both the number of bits to describe a network as well the total number
of unique weights in a network. The motivation to do so is driven both by prac-
tical considerations of accelerator designs [1, 2], as well as the more theoretical
1Paper Published: 978-3-031-20082-3, ECCV 2022, Part XI, LNCS 13671
arXiv:2210.13554v1 [cs.LG] 24 Oct 2022
2 Subia-Waud et al.
persuasions of the Minimal Description Length (MDL) principle [3–5] as a way
of determining a good model.
ResNet-18 ResNet-18
ResNet34 ResNet-34
100
101
102
103
104
105
106
107
Total Unique Param Count (log)
0
5
10
15
20
25
Weight-Space Entropy
Baseline
APoT (3-bit)
WFN
1.9x
Smaller
1.75x
Smaller
56x
Fewer
71x
Fewer
Fig. 1. WFN reduces the total number of weights and the entropy of a network far
further than other quantisation works. Left: The total number of unique parameters
left after quantisation is 56x fewer than APoT for ResNet-18 trained on the Imagenet
dataset and 71x for the ResNet-34 model. Right: The entropy of the parameters across
the network is 1.9x and 1.65x smaller when using the WFN approach over APoT.
The Minimal Description Length. Chaitin’s hypothesis captures the think-
ing behind the MDL principle with the statement “comprehension is compres-
sion” [6, 7]. That is, to learn is to compress. In this setting, the best model
minimises the combined cost of describing both the model and the prediction
errors. Deep learning models have shown themselves adept at minimising the
latter, but it is not clear that we sufficiently compress the model description
through unbounded standard gradient descent training. One way to think about
MDL in relation to deep learning compression is the following 2:
Imagine that Bob wants to communicate ground-truth targets to Alice. To
achieve this, he can forward both a predictive model and its errors, compressed
to as few bits as possible without losing any information. Given these two com-
ponents, Alice can pass the input data into the model and, given the commu-
nicated errors from Bob, make any adjustments to its output to retrieve the
desired ground truth. This formulation is the two-part compression approach
to the problem of learning [6]. The MDL principle says that the best model is
obtained by minimising the sum of the lengths of the codes for model and errors.
Although the MDL treatment of learning is largely theoretical, it has mo-
tivated the design of compressed network architectures [8–10]. We believe that
a more direct optimisation to minimise the information theoretic content could
bear fruit for downstream hardware translations, but let us start by setting out
the description quantities we wish to minimise.
Describing a Solution. Describing a classification model’s error is well cap-
tured with the cross-entropy loss. From an information-theoretic perspective, the
2originally posed in ‘Keeping neural networks simple by minimizing the description
length of the weights’ [8]
Weight Fixing Networks 3
cross-entropy loss measures the average message length per data point needed to
describe the difference in predicted and target output distributions. Essentially,
large errors cost more to describe than small errors.
The cost of describing a model is more complex, requiring two elements to
be communicated – the weight values, and their arrangement. A metric used
to capture both components is the representation cost, as outlined in Deep K-
means [11] (Equation 5 below). Here, the cost of describing the model is defined
as the summed bit-width of each weight representation times the number of
times each weight is used in an inference calculation.
Minimising the Representation Costs in Accelerators. Importantly, this
representation cost as a model description can be translated directly into acceler-
ator design savings, as shown by the seminal work in Deep Compression [12] and
subsequent accelerator design, EIE [2]. Here the authors cluster neural networks’
weights and use Huffman encoding to represent/describe the network cheaply.
From an information-theoretic perspective, the motivation for Huffman encoding
is simple; this encoding scheme is likely to give us a compressed result closest to
our underlying weight-space entropy. However, this work was not focused on the
information content, so why was it advantageous to an accelerator design? For
this, we need to look at where the computation costs are most concentrated.
The most expensive energy costs in inference calculations lie in memory
reads [13, 14]. For every off-chip DRAM data read, we pay the equivalent of
over two hundred 32-bit multiplications in energy costs3[13]. This energy cost
concentration has led to the pursuit of data movement minimisation schemes
using accelerator dataflow mappings [2, 15–17]. These mappings aim to store
as much of the network as possible close to computation and maximise weight
re-use. From an algorithmic perspective, this makes networks with low entropy
content desirable. To make use of a low entropy weight-space, a dataflow map-
ping can store compressed codewords for each weight which, when decoded, point
to the address of the full precision weight, which itself is stored in cheaper access
close-to-compute memory. The idea is that the addition of the codeword storage
and access costs plus the full weight value access costs can be much smaller than
in the unquantised network [11,18]. Several accelerator designs have successfully
implemented such a scheme [1, 2].
As a simple illustrative example, let us define a filter in a network post-
quantisation with the values [900,104,211,104,104,104,399,211,104]. This net-
work has an entropy of 1.65, meaning each weight can be represented, on average,
using a minimum of 1.65 bits, compared to the 9.8bits (log(900)) needed for an
uncompressed version of the same network. Using Huffman encoding, we get
close to this bound by encoding the network weights as w7→ c(w) with:
c(104) = 1, c(211) = 01, c(399) = 001, c(900) = 000
and the complete filter can be represented as “000101111001011”, totalling just
15 bits, 1.66 bits on average per value. Each decoded codeword points a corre-
sponding weight in the reduced set of unique weights required for computation.
345nm process
4 Subia-Waud et al.
These unique weights (since there are very few of them) can be stored close
to compute on memory units, processing element scratchpads or SRAM cache
depending on the hardware flavour [19, 20], all of which have minimal (almost
free) access costs. The storage and data movement cost of the encoded weights
plus the close-to-compute weight access should be smaller than the storage and
movement costs of directly representing the network with the weight values.
This link – between minimising the model description and reducing accelerator
representational costs – motivates our approach.
Objectives. So we ask ourselves what we could do algorithmically to maximise
the benefit of accelerator dataflows and minimise the description length. Since
Huffman encoding is used extensively in accelerator designs, we focus on find-
ing networks that reduce the network description when compressed using this
scheme. To do this, we first focus on reducing the number of unique weights a
network uses. Fewer unique weights whilst fixing the network topology and the
total number of parameters will mean that more weights are re-used more often.
Further gains can be achieved if we can concentrate the distribution of weights
around a handful of values, enabling frequently used weights to be stored cheaply,
close to compute. Finally, we ask what the ideal values of these weights would
be. From a computational perspective, not all multiplications are created equal.
Powers-of-two, for example, can be implemented as simple bit-shifts. Mapping
the weights used most to these values offers potential further optimisation in
hardware. Putting these three requirements together: few unique weights; a low-
entropy encoding with a distribution of weights highly concentrated around a
tiny subset of values; and a focus on powers-of-two values for weights — all moti-
vated to both minimise the MDL as well as the computation costs in accelerator
designs — we present our contribution.
Weight Fixing Networks. Our work’s overarching objective is to transform
a network comprising many weights of any value (limited only by value pre-
cision) to one with the same number of weights but just a few unique values
and concentrate the weights around an even smaller subset of weights. Rather
than selecting the unique weights a priori, we let the optimisation guide the pro-
cess in an iterative cluster-then-train approach. We cluster an ever-increasing
subset of weights to one of a few cluster centroids in each iteration. We map
the pre-trained network weights to these cluster centroids, which constitute a
pool of unique weights. The training stage follows standard gradient descent
optimisation to minimise performance loss with two key additions. Firstly, only
an ever decreasing subset of the weights are free to be updated. We also use a
new regularisation term to penalise weights with large relative distances to their
nearest clusters. We iteratively cluster subsets of weights to their nearest cluster
centre, with the way we determine which subset to move a core component of
our contribution.
Small Relative Distance Change. Rather than selecting subsets with small
Euclidean distances to cluster centres, or those that have small magnitude [21],
we make the simple observation that the relative – as opposed to absolute –
weight change matters. We find that the tolerated distance δwiwe can move a
摘要:

WeightFixingNetworksChristopherSubia-Waud&SrinandanDasmahapatraUniversityofSouthampton,Southampton,SO171BJfcc2u18,sdg@soton.ac.ukAbstract.Moderniterationsofdeeplearningmodelscontainmillions(billions)ofuniqueparameters|eachrepresentedbyab-bitnumber.Popularattemptsatcompressingneuralnetworks(suchaspru...

展开>> 收起<<
Weight Fixing Networks Christopher Subia-Waud Srinandan Dasmahapatra University of Southampton Southampton SO17 1BJ.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:18 页 大小:590.64KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注