Weight Fixing Networks Christopher Subia-Waud Srinandan Dasmahapatra University of Southampton Southampton SO17 1BJ

2025-05-06 0 0 590.64KB 18 页 10玖币

侵权投诉

Weight Fixing Networks

Christopher Subia-Waud & Srinandan Dasmahapatra

University of Southampton, Southampton, SO17 1BJ

{cc2u18, sd}@soton.ac.uk

Abstract. Modern iterations of deep learning models contain millions

(billions) of unique parameters—each represented by a b-bit number.

Popular attempts at compressing neural networks (such as pruning and

quantisation) have shown that many of the parameters are superﬂuous,

which we can remove (pruning) or express with b0< b bits (quantisa-

tion) without hindering performance. Here we look to go much further

in minimising the information content of networks. Rather than a chan-

nel or layer-wise encoding, we look to lossless whole-network quantisa-

tion to minimise the entropy and number of unique parameters in a

network. We propose a new method, which we call Weight Fixing Net-

works (WFN) that we design to realise four model outcome objectives:

i) very few unique weights, ii) low-entropy weight encodings, iii) unique

weight values which are amenable to energy-saving versions of hardware

multiplication, and iv) lossless task-performance. Some of these goals are

conﬂicting. To best balance these conﬂicts, we combine a few novel (and

some well-trodden) tricks; a novel regularisation term, (i, ii) a view of

clustering cost as relative distance change (i, ii, iv), and a focus on whole-

network re-use of weights (i, iii). Our Imagenet experiments demonstrate

lossless compression using 56x fewer unique weights and a 1.9x lower

weight-space entropy than SOTA quantisation approaches. Code and

model saves can be found at github.com/subiawaud/Weight Fix Networks1.

Keywords: Compression, Quantization, Minimal Description Length,

Deep Learning Accelerators

1 Introduction

Deep learning models have a seemingly inexorable trajectory toward growth.

Growth in applicability, performance, investment, and optimism. Unfortunately,

one area of growth is lamentable - the ever-growing energy and storage costs

required to train and make predictions. To fully realise the promise of deep

learning methods, work is needed to reduce these costs without hindering task

performance.

Here we look to contribute a methodology and refocus towards the goal of

reducing both the number of bits to describe a network as well the total number

of unique weights in a network. The motivation to do so is driven both by prac-

tical considerations of accelerator designs [1, 2], as well as the more theoretical

1Paper Published: 978-3-031-20082-3, ECCV 2022, Part XI, LNCS 13671

arXiv:2210.13554v1 [cs.LG] 24 Oct 2022

2 Subia-Waud et al.

persuasions of the Minimal Description Length (MDL) principle [3–5] as a way

of determining a good model.

ResNet-18 ResNet-18

ResNet34 ResNet-34

100

101

102

103

104

105

106

107

Total Unique Param Count (log)

Weight-Space Entropy

Baseline

APoT (3-bit)

WFN

1.9x

Smaller

1.75x

Smaller

56x

Fewer

71x

Fewer

Fig. 1. WFN reduces the total number of weights and the entropy of a network far

further than other quantisation works. Left: The total number of unique parameters

left after quantisation is 56x fewer than APoT for ResNet-18 trained on the Imagenet

dataset and 71x for the ResNet-34 model. Right: The entropy of the parameters across

the network is 1.9x and 1.65x smaller when using the WFN approach over APoT.

The Minimal Description Length. Chaitin’s hypothesis captures the think-

ing behind the MDL principle with the statement “comprehension is compres-

sion” [6, 7]. That is, to learn is to compress. In this setting, the best model

minimises the combined cost of describing both the model and the prediction

errors. Deep learning models have shown themselves adept at minimising the

latter, but it is not clear that we suﬃciently compress the model description

through unbounded standard gradient descent training. One way to think about

MDL in relation to deep learning compression is the following 2:

Imagine that Bob wants to communicate ground-truth targets to Alice. To

achieve this, he can forward both a predictive model and its errors, compressed

to as few bits as possible without losing any information. Given these two com-

ponents, Alice can pass the input data into the model and, given the commu-

nicated errors from Bob, make any adjustments to its output to retrieve the

desired ground truth. This formulation is the two-part compression approach

to the problem of learning [6]. The MDL principle says that the best model is

obtained by minimising the sum of the lengths of the codes for model and errors.

Although the MDL treatment of learning is largely theoretical, it has mo-

tivated the design of compressed network architectures [8–10]. We believe that

a more direct optimisation to minimise the information theoretic content could

bear fruit for downstream hardware translations, but let us start by setting out

the description quantities we wish to minimise.

Describing a Solution. Describing a classiﬁcation model’s error is well cap-

tured with the cross-entropy loss. From an information-theoretic perspective, the

2originally posed in ‘Keeping neural networks simple by minimizing the description

length of the weights’ [8]

Weight Fixing Networks 3

cross-entropy loss measures the average message length per data point needed to

describe the diﬀerence in predicted and target output distributions. Essentially,

large errors cost more to describe than small errors.

The cost of describing a model is more complex, requiring two elements to

be communicated – the weight values, and their arrangement. A metric used

to capture both components is the representation cost, as outlined in Deep K-

means [11] (Equation 5 below). Here, the cost of describing the model is deﬁned

as the summed bit-width of each weight representation times the number of

times each weight is used in an inference calculation.

Minimising the Representation Costs in Accelerators. Importantly, this

representation cost as a model description can be translated directly into acceler-

ator design savings, as shown by the seminal work in Deep Compression [12] and

subsequent accelerator design, EIE [2]. Here the authors cluster neural networks’

weights and use Huﬀman encoding to represent/describe the network cheaply.

From an information-theoretic perspective, the motivation for Huﬀman encoding

is simple; this encoding scheme is likely to give us a compressed result closest to

our underlying weight-space entropy. However, this work was not focused on the

information content, so why was it advantageous to an accelerator design? For

this, we need to look at where the computation costs are most concentrated.

The most expensive energy costs in inference calculations lie in memory

reads [13, 14]. For every oﬀ-chip DRAM data read, we pay the equivalent of

over two hundred 32-bit multiplications in energy costs3[13]. This energy cost

concentration has led to the pursuit of data movement minimisation schemes

using accelerator dataﬂow mappings [2, 15–17]. These mappings aim to store

as much of the network as possible close to computation and maximise weight

re-use. From an algorithmic perspective, this makes networks with low entropy

content desirable. To make use of a low entropy weight-space, a dataﬂow map-

ping can store compressed codewords for each weight which, when decoded, point

to the address of the full precision weight, which itself is stored in cheaper access

close-to-compute memory. The idea is that the addition of the codeword storage

and access costs plus the full weight value access costs can be much smaller than

in the unquantised network [11,18]. Several accelerator designs have successfully

implemented such a scheme [1, 2].

As a simple illustrative example, let us deﬁne a ﬁlter in a network post-

quantisation with the values [900,104,211,104,104,104,399,211,104]. This net-

work has an entropy of 1.65, meaning each weight can be represented, on average,

using a minimum of 1.65 bits, compared to the 9.8bits (log(900)) needed for an

uncompressed version of the same network. Using Huﬀman encoding, we get

close to this bound by encoding the network weights as w7→ c(w) with:

c(104) = 1, c(211) = 01, c(399) = 001, c(900) = 000

and the complete ﬁlter can be represented as “000101111001011”, totalling just

15 bits, 1.66 bits on average per value. Each decoded codeword points a corre-

sponding weight in the reduced set of unique weights required for computation.

345nm process

4 Subia-Waud et al.

These unique weights (since there are very few of them) can be stored close

to compute on memory units, processing element scratchpads or SRAM cache

depending on the hardware ﬂavour [19, 20], all of which have minimal (almost

free) access costs. The storage and data movement cost of the encoded weights

plus the close-to-compute weight access should be smaller than the storage and

movement costs of directly representing the network with the weight values.

This link – between minimising the model description and reducing accelerator

representational costs – motivates our approach.

Objectives. So we ask ourselves what we could do algorithmically to maximise

the beneﬁt of accelerator dataﬂows and minimise the description length. Since

Huﬀman encoding is used extensively in accelerator designs, we focus on ﬁnd-

ing networks that reduce the network description when compressed using this

scheme. To do this, we ﬁrst focus on reducing the number of unique weights a

network uses. Fewer unique weights whilst ﬁxing the network topology and the

total number of parameters will mean that more weights are re-used more often.

Further gains can be achieved if we can concentrate the distribution of weights

around a handful of values, enabling frequently used weights to be stored cheaply,

close to compute. Finally, we ask what the ideal values of these weights would

be. From a computational perspective, not all multiplications are created equal.

Powers-of-two, for example, can be implemented as simple bit-shifts. Mapping

the weights used most to these values oﬀers potential further optimisation in

hardware. Putting these three requirements together: few unique weights; a low-

entropy encoding with a distribution of weights highly concentrated around a

tiny subset of values; and a focus on powers-of-two values for weights — all moti-

vated to both minimise the MDL as well as the computation costs in accelerator

designs — we present our contribution.

Weight Fixing Networks. Our work’s overarching objective is to transform

a network comprising many weights of any value (limited only by value pre-

cision) to one with the same number of weights but just a few unique values

and concentrate the weights around an even smaller subset of weights. Rather

than selecting the unique weights a priori, we let the optimisation guide the pro-

cess in an iterative cluster-then-train approach. We cluster an ever-increasing

subset of weights to one of a few cluster centroids in each iteration. We map

the pre-trained network weights to these cluster centroids, which constitute a

pool of unique weights. The training stage follows standard gradient descent

optimisation to minimise performance loss with two key additions. Firstly, only

an ever decreasing subset of the weights are free to be updated. We also use a

new regularisation term to penalise weights with large relative distances to their

nearest clusters. We iteratively cluster subsets of weights to their nearest cluster

centre, with the way we determine which subset to move a core component of

our contribution.

Small Relative Distance Change. Rather than selecting subsets with small

Euclidean distances to cluster centres, or those that have small magnitude [21],

we make the simple observation that the relative – as opposed to absolute –

weight change matters. We ﬁnd that the tolerated distance δwiwe can move a

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WeightFixingNetworksChristopherSubia-Waud&SrinandanDasmahapatraUniversityofSouthampton,Southampton,SO171BJfcc2u18,sdg@soton.ac.ukAbstract.Moderniterationsofdeeplearningmodelscontainmillions(billions)ofuniqueparameters|eachrepresentedbyab-bitnumber.Popularattemptsatcompressingneuralnetworks(suchaspru...

展开>> 收起<<

Weight Fixing Networks Christopher Subia-Waud Srinandan Dasmahapatra University of Southampton Southampton SO17 1BJ.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Weight Fixing Networks Christopher Subia-Waud Srinandan Dasmahapatra University of Southampton Southampton SO17 1BJ

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: