1 Block Format Error Bounds and Optimal Block Size Selection

2025-04-24 0 0 846.74KB 11 页 10玖币

侵权投诉

Block Format Error Bounds

and Optimal Block Size Selection

Ilya Soloveychik, Ilya Lyubomirsky, Xin Wang and Sudeep Bhoja

d-Matrix

Abstract—The amounts of data that need to be transmitted,

processed, and stored by the modern deep neural networks have

reached truly enormous volumes in the last few years calling for

the invention of new paradigms both in hardware and software

development. One of the most promising and rapidly advancing

frontiers here is the creation of new numerical formats. In this

work we focus on the family of block ﬂoating point numerical

formats due to their combination of wide dynamic range,

numerical accuracy, and efﬁcient hardware implementation of

inner products using simple integer arithmetic. These formats

are characterized by a block of mantissas with a shared scale

factor. The basic Block Floating Point (BFP) format quantizes

the block scales into the nearest powers of two on the right.

Its simple modiﬁcation - Scaled BFP (SBFP) - stores the same

scales in full precision and thus allows higher accuracy. In this

paper, we study the statistical behavior of both these formats

rigorously. We develop asymptotic bounds on the inner product

error in SBFP- and BFP-quantized normally distributed vectors.

Next, we reﬁne those asymptotic results to ﬁnite dimensional

settings and derive high-dimensional tight bounds for the same

errors. Based on the obtained results we introduce a performance

measure assessing accuracy of any block format. This measure

allows us to determine the optimal parameters, such as the block

size, yielding highest accuracy. In particular, we show that if the

precision of the BFP format is ﬁxed at 4bits, the optimal block

size becomes 64. All theoretical derivations are supported by

numerical experiments and studies on the weights of publicly

available pretrained neural networks.

Index Terms—Deep leaning, in-memory compute, block ﬂoat-

ing point, quantization.

I. INTRODUCTION

In the recent years, online services addressing requests

of billions of customers worldwide have become ubiquitous.

Such customization is usually provided by Deep Neural Net-

works (DNN) deployed at data centers and equipped with

unprecedented computational capabilities. The recently dis-

covered attention paradigm [1] has led to the explosion in the

variety and complexity of the modern high-performance DNNs

based on the transformer architecture [1–3]. Transformers

have achieved tremendous success in all Natural Language

Processing (NLP) tasks due to their ability to learn universal

language representations from large volumes of unlabeled text

data and then transfer this knowledge to downstream tasks

[4, 5]. Importantly, the most common operation in transformers

consuming the majority of the compute and power resources

is matrix multiplication [6, 7]. Creating the infrastructure that

will make the multiplication of the involved large matrices

faster and cheaper without impacting the product accuracy

has become the central focus of the leading researchers and

companies working in this ﬁeld [8–10].

Enabling faster inference and more reliable training of

large DNNs is usually limited by the arithmetic bandwidth of

the underlying hardware [6, 7, 9, 10]. Quite often designers

and operators of transformer-based networks exploit Graph-

ics Processing Unit (GPUs) as the workhorse because they

offer higher arithmetic density per silicon area than Central

Processing Unit (CPUs) [6, 7]. Inference and training tasks

in all these cases are usually processed through full precision

ﬂoating-point units. However, the computational load required

by modern transformers has reached such enormous volumes

that traditional GPUs cannot fully meet the growing demand,

pushing both accelerators and high performance GPUs towards

narrow arithmetic. As a consequence, unmatched research

efforts have been applied by the engineering community to

replace narrow ﬂoating-point with even denser ﬁxed-point rep-

resentations [11–14]. Despite the excellent gains in both speed

and computational density achieved by ﬁxed-point arithmetic,

training using it or even half-precision ﬂoating-point arithmetic

has not provided clear evidence in its favor due to the limited

dynamic range inherent in such formats [15].

Block Floating Point (BFP) numerical formats have received

renewed interest recently for DNNs inference applications due

to their combination of wide dynamic range, numerical accu-

racy, and efﬁcient hardware implementation of inner products

using simple integer arithmetic [16, 17]. BFP formats are

characterized by a block of mantissas with a shared scale

factor. The simplest implementation has the scale factor as a

power of two, the so-called exponent, in which case the inner

product between two blocks involves multiplying the integer

mantissas and adding the two block exponents.

Another useful version of the BFP format involves using a

ﬂoating point scale factor. The latter provides more numerical

accuracy at the expense of slightly increased complexity

for inner product operation since we need a ﬂoating point

multiplication of the scale factors [18]. In this paper, we

consider both formats calling the former simply BFP and the

latter SBFP, where the S stands for ﬂoating point scale factor.

Also, for simplicity and without much impact on the result,

in this study we treat the block scales of SBFP as inﬁnite

precision numbers.

The contribution of this article is three-fold. First, we

develop asymptotic bounds on the inner product error in SBFP-

and BFP-quantized normally distributed vectors. Second, we

reﬁne those asymptotic results to ﬁnite dimensional settings

and derive high-dimensional tight bounds for the same errors.

Third, we utilize the obtained results to introduce a perfor-

mance measure assessing the typical accuracy of any block

format. Such a framework enables us to determine the optimal

arXiv:2210.05470v3 [cs.LG] 7 Nov 2022

values of the block format parameters providing the highest

possible accuracy. An example of such parameter selection is

the block size tuning, which we study in detail. In particular,

we show that if the precision of the BFP format is ﬁxed

at 4bits, the optimal block size becomes n= 64. All our

theoretical derivations are supported by numerical simulations

and studies on the weights of publicly available pretrained

neural networks. To the best of our knowledge this article is

the ﬁrst work shedding light on the statistical properties of

block formats through rigorous studies.

The rest of the text is organized as follows. First, we

deﬁne the formats and state the problem of bounding the inner

product quantization error in Sections II and III, respectively.

In Section IV we formulate the asymptotic bounds for large

block sizes which provide much insight into the behavior of

the scalar product quantization error. Section V then reﬁnes the

asymptotic bounds into high-dimensional bounds which better

suit the needs of small block accuracy analysis and optimal

parameter selection in practice. In Section VI we introduce a

new framework allowing assessment of any block formats with

respect to the optimal behavior of SBFP and show how such

a methodology can be used to determine the optimal block

size for any block format and dataset. Section VII provides

extensive numerical simulations using synthetic data as well

as publicly available pretrained neural networks supporting our

theoretical ﬁndings and conclusions. The technical details of

the proofs can be found in the Appendix.

II. BLOCK QUANTIZATION

Recent extensive studies of popular DNNs based on the

transformer mechanism have shown that their trained weights

often have comparable amplitudes [16, 17]. This observation

naturally leads to the following simple compression algorithm.

Given a list of weights, we can store some common scaling

factor in high precision and a set of small precision integers

for each element of the list. This idea is formalized by the

family of block formats studied in this work.

Deﬁnition 1 (Block Format).A block format is a triple of

a block size n∈N, mantissa precision p∈N, and a

numerical format for the scaling factor S. Given such a triple,

all the elements {Mi}n

i=1 of a block are stored as integers aka

mantissas in the range of −(2p−1−1),2p−1−1, and their

values are computed as

{S·M1, . . . , S ·Mn}.(1)

The block size nand precision pusually play the role of

degrees of freedom or format parameters that can be altered.

Therefore, a block format is determined by the scale factor

numerical format.

A. SBFP Quantization

Let us ﬁx the block size nand mantissa precision pfor a

moment. Assume we are given nreal numbers X1, . . . , Xn∈

Rthat we want to store in a block. Denote their maximal

absolute value by

Y=n

max

i=1 |Xi|,(2)

and set

α= 2p−1−1.(3)

Let us rescale the Xivalues as

Zi=αXi

Y∈[−α, α], i = 1, . . . , n. (4)

Next we round every Zito the closest integer and these will

be their corresponding mantissas, I[Zi]. The Scaled Block

Floating Point (SBFP) representation is now characterized by

the scale Ywhich is stored in full precision and nintegers

I[Zi], i = 1, . . . , n stored using pbits each.

B. BFP Quantization

For the plain Block Floating Point (BFP) quantization

format, the way we memorize the scale changes. Here, instead

of storing the maximal absolute value we keep its nearest

power of 2from above,

α= 2dlog2Y

αe,(5)

where in the right-hand side we used the standard notation

for the ceiling function. As a consequence, quite often the

new scalar will be signiﬁcantly larger than the optimal choice

which might have detrimental effect on the resulting accuracy,

as we shall see later.

III. STATISTICAL MODEL

As mentioned earlier, one of the central problems in the

modern hardware development focusing at large-scale DNNs

is designing processing units capable of accurate matrix multi-

plication under heavy memory, compute, and power limitations

[6–10]. An inherent part of this endeavor is examining the

accuracy of the product as a function of the numerical format

used to quantize the multipliers. In this work, we focus on

the elementary building block of matrix multiplication - the

scalar product of two vectors. To study the properties of the

SBFP and BFP quantized inner products systematically and

formulate universally meaningful claims, we have to exploit

well-speciﬁed populations of weights that would be both use-

ful in practice and amenable to rigorous derivations. We chose

to use normally distributed data. First, normal populations well

approximate real distributions of weights in neural networks

because the latter are usually initialized normally at random

and on average shift only slightly during the training process

[19]. Second, normal distributions are easy to deal with from

both theoretical and practical points of view. Namely, quite

often theoretical results involving normal populations allow

solution by quadrature yielding meaningful expressions that

provide important insights into the nature of the observed

phenomena, like in our case. In addition, normal variables are

easy to simulate due a huge number of existing algorithms

generating normal pseudo-random numbers of good quality.

Assume we have two vectors nX(j)

ion

i=1 , j = 1,2of n

random variables which are i.i.d. (independent and identically

distributed) normal of mean zero and variance σ2each. After

quantizing them into a block format using the scheme de-

scribed in Section II-A above, denote the absolute quantization

error of each element by

∆Z(j)

i=Z(j)

i− I hZ(j)

ii, i = 1, . . . , n, j = 1,2.(6)

Note that for each jthere exists an index i(j)

∗, for which Z(j)

i∗is

equal to αjor −αjand is therefore nonrandom. Note also that

already for moderate values of precision pj, we can assume

without loss of accuracy that ∆Z(j)

i, i 6=i(j)

∗are uniformly

distributed within −1

2,1

2.

For the SBFP format study, the quantity we shall be focusing

on below is the absolute error of the scalar product of two

blocks,

∆Es=X

X(1)

iX(2)

i−Y(1)

α1

Y(2)

α2X

iIhZ(1)

iiIhZ(2)

ii,

(7)

where the subscript sstands for SBFP quantization. An

analogous error for BFP will be denoted as ∆Eb. In this paper

we will be focusing on the study of the statistical properties

of these errors as a function of block format parameters, such

as precision pand block size n.

IV. ASYMPTOTIC BOUNDS

We start by deriving the limiting bounds on the approx-

imation quality of SBFP and BFP formats when the block

size ntends to inﬁnity. Such results will provide a number of

useful insights into the quantization properties of the formats at

hand. Later we reﬁne them for ﬁnite block sizes. To rigorously

formulate the results will use the following auxiliary deﬁnition.

Deﬁnition 2 ([20]).We shall say that a real centered random

variable Wis sub-Gaussian with variance proxy σ2>0if its

moment generating function can be bounded as

EesW 6eσ2s2

2,∀s∈R.(8)

As follows from the name, the statistical properties of sub-

Gaussian random variables mimic those of Gaussian ones.

Lemma 4 from the Appendix lists a few such properties that

will be useful in developing the bounds at question. For a

more complete review of sub-Gaussianity we refer the reader

to [20] and the references cited therein.

A. SBFP Quantization

Proposition 1 (Asymptotic SBFP Bound).For 1ln n

maxj2pj, the variance of the error of the inner product of

SBFP-quantized vectors can be bounded by

var [∆Es]6σ4

81

22(p1−1) +1

22(p2−1) nln 4n2

2πln(2n2/π),

(9)

moreover ∆Esis a symmetric sub-Gaussian random variable

with variance proxy 2var [∆Es]and the tail bound

P[|∆Es|>t]62 exp −t2

4var [∆Es].(10)

Proof. The proof can be found in the Appendix.

Following the same line as above, we can get a similar

asymptotic result for the BFP format as well. For simplicity

of notation we assume p1=p2.

Proposition 2 (Asymptotic BFP Bound).If p1=p2=pand

1ln n2p, the variance of the error of the inner product

of BFP-quantized vectors can be bounded by

var [∆Eb]6σ2

4n22llog2(σ

2p−1)+1

2log2ln4n2

2πln(2n2/π)m,(11)

moreover ∆Ebis a symmetric sub-Gaussian random variable

with variance proxy 2var [∆Eb]and the tail bound

P[|∆Eb|>t]62 exp −t2

4var [∆Eb].(12)

Proof. The proof can be found in the Appendix.

A few notes are in place here. First, we see that the behavior

of the bounds as functions of the mantissas precision are

similar, namely both decrease with pjexponentially. This

observation is conﬁrmed by extensive numerical experiments

in Section VII below. Second, the dependency of the variance

on the block size ncan be broken into two multiplicative

factors: 1) the linear dependency on nand 2) the logarith-

mic dependency on n. The linear dependency is due to the

proportional growth of the variance of the error with the

length of the vectors being multiplied. Indeed, summing up

more i.i.d. variables naturally leads to the linear grows of the

variance. The logarithmic multiplier is more surprising. In fact,

it comes from the distribution of the maximum of Gaussian

random variables as explained in the Appendix section. Here

we only mention that given ni.i.d. standard normal random

variables, the expected maximum of the absolute values of

these variables grows approximately as √2logn. Since the

variance is proportional to the square of the expected block

scales - as explained in (7) above and Lemma 6 below - its

growth with nbecomes logarithmic. All these observations are

also veriﬁed by the numerical experiments of Section VII-A.

Note also that sub-Gaussianity of the errors imply their tight

concentration around the mean. This implies that the obtained

bounds, although asymptotic, are quite tight.

It is important to emphasize the presence of the ceiling

function in the expression (11) of the BFP bound. This leads to

jumps in the values of the bound at those block sizes, at which

the expected values of the scales Yn

αpass over integer powers

of two. A more detailed discussion supported by numerical

simulations can be found in Section VII-A below.

V. HIGH-DIMENSIONAL BOUNDS

Numerical simulations of Section VII-A show that for ﬁnite

block sizes nthe bound obtained in Proposition 2 becomes

quite loose. In this section, we provide ﬁnite-dimensional

improvements of the variance bounds from Propositions 1

and 2. Such improvements however come at a price. Here we

already cannot solve the result by quadrature and are forced to

use numerical integration for our simulations in Section VII.

Lemma 1. Let Φand φdenote the cdf and pdf of the standard

normal distribution, respectively, and for some σ2>0Xi∼

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1BlockFormatErrorBoundsandOptimalBlockSizeSelectionIlyaSoloveychik,IlyaLyubomirsky,XinWangandSudeepBhojad-MatrixAbstractTheamountsofdatathatneedtobetransmitted,processed,andstoredbythemoderndeepneuralnetworkshavereachedtrulyenormousvolumesinthelastfewyearscallingfortheinventionofnewparadigmsbothinh...

展开>> 收起<<

1 Block Format Error Bounds and Optimal Block Size Selection.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Block Format Error Bounds and Optimal Block Size Selection

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: