1 Block Format Error Bounds and Optimal Block Size Selection

2025-04-24 0 0 846.74KB 11 页 10玖币
侵权投诉
1
Block Format Error Bounds
and Optimal Block Size Selection
Ilya Soloveychik, Ilya Lyubomirsky, Xin Wang and Sudeep Bhoja
d-Matrix
Abstract—The amounts of data that need to be transmitted,
processed, and stored by the modern deep neural networks have
reached truly enormous volumes in the last few years calling for
the invention of new paradigms both in hardware and software
development. One of the most promising and rapidly advancing
frontiers here is the creation of new numerical formats. In this
work we focus on the family of block floating point numerical
formats due to their combination of wide dynamic range,
numerical accuracy, and efficient hardware implementation of
inner products using simple integer arithmetic. These formats
are characterized by a block of mantissas with a shared scale
factor. The basic Block Floating Point (BFP) format quantizes
the block scales into the nearest powers of two on the right.
Its simple modification - Scaled BFP (SBFP) - stores the same
scales in full precision and thus allows higher accuracy. In this
paper, we study the statistical behavior of both these formats
rigorously. We develop asymptotic bounds on the inner product
error in SBFP- and BFP-quantized normally distributed vectors.
Next, we refine those asymptotic results to finite dimensional
settings and derive high-dimensional tight bounds for the same
errors. Based on the obtained results we introduce a performance
measure assessing accuracy of any block format. This measure
allows us to determine the optimal parameters, such as the block
size, yielding highest accuracy. In particular, we show that if the
precision of the BFP format is fixed at 4bits, the optimal block
size becomes 64. All theoretical derivations are supported by
numerical experiments and studies on the weights of publicly
available pretrained neural networks.
Index Terms—Deep leaning, in-memory compute, block float-
ing point, quantization.
I. INTRODUCTION
In the recent years, online services addressing requests
of billions of customers worldwide have become ubiquitous.
Such customization is usually provided by Deep Neural Net-
works (DNN) deployed at data centers and equipped with
unprecedented computational capabilities. The recently dis-
covered attention paradigm [1] has led to the explosion in the
variety and complexity of the modern high-performance DNNs
based on the transformer architecture [1–3]. Transformers
have achieved tremendous success in all Natural Language
Processing (NLP) tasks due to their ability to learn universal
language representations from large volumes of unlabeled text
data and then transfer this knowledge to downstream tasks
[4, 5]. Importantly, the most common operation in transformers
consuming the majority of the compute and power resources
is matrix multiplication [6, 7]. Creating the infrastructure that
will make the multiplication of the involved large matrices
faster and cheaper without impacting the product accuracy
has become the central focus of the leading researchers and
companies working in this field [8–10].
Enabling faster inference and more reliable training of
large DNNs is usually limited by the arithmetic bandwidth of
the underlying hardware [6, 7, 9, 10]. Quite often designers
and operators of transformer-based networks exploit Graph-
ics Processing Unit (GPUs) as the workhorse because they
offer higher arithmetic density per silicon area than Central
Processing Unit (CPUs) [6, 7]. Inference and training tasks
in all these cases are usually processed through full precision
floating-point units. However, the computational load required
by modern transformers has reached such enormous volumes
that traditional GPUs cannot fully meet the growing demand,
pushing both accelerators and high performance GPUs towards
narrow arithmetic. As a consequence, unmatched research
efforts have been applied by the engineering community to
replace narrow floating-point with even denser fixed-point rep-
resentations [11–14]. Despite the excellent gains in both speed
and computational density achieved by fixed-point arithmetic,
training using it or even half-precision floating-point arithmetic
has not provided clear evidence in its favor due to the limited
dynamic range inherent in such formats [15].
Block Floating Point (BFP) numerical formats have received
renewed interest recently for DNNs inference applications due
to their combination of wide dynamic range, numerical accu-
racy, and efficient hardware implementation of inner products
using simple integer arithmetic [16, 17]. BFP formats are
characterized by a block of mantissas with a shared scale
factor. The simplest implementation has the scale factor as a
power of two, the so-called exponent, in which case the inner
product between two blocks involves multiplying the integer
mantissas and adding the two block exponents.
Another useful version of the BFP format involves using a
floating point scale factor. The latter provides more numerical
accuracy at the expense of slightly increased complexity
for inner product operation since we need a floating point
multiplication of the scale factors [18]. In this paper, we
consider both formats calling the former simply BFP and the
latter SBFP, where the S stands for floating point scale factor.
Also, for simplicity and without much impact on the result,
in this study we treat the block scales of SBFP as infinite
precision numbers.
The contribution of this article is three-fold. First, we
develop asymptotic bounds on the inner product error in SBFP-
and BFP-quantized normally distributed vectors. Second, we
refine those asymptotic results to finite dimensional settings
and derive high-dimensional tight bounds for the same errors.
Third, we utilize the obtained results to introduce a perfor-
mance measure assessing the typical accuracy of any block
format. Such a framework enables us to determine the optimal
arXiv:2210.05470v3 [cs.LG] 7 Nov 2022
2
values of the block format parameters providing the highest
possible accuracy. An example of such parameter selection is
the block size tuning, which we study in detail. In particular,
we show that if the precision of the BFP format is fixed
at 4bits, the optimal block size becomes n= 64. All our
theoretical derivations are supported by numerical simulations
and studies on the weights of publicly available pretrained
neural networks. To the best of our knowledge this article is
the first work shedding light on the statistical properties of
block formats through rigorous studies.
The rest of the text is organized as follows. First, we
define the formats and state the problem of bounding the inner
product quantization error in Sections II and III, respectively.
In Section IV we formulate the asymptotic bounds for large
block sizes which provide much insight into the behavior of
the scalar product quantization error. Section V then refines the
asymptotic bounds into high-dimensional bounds which better
suit the needs of small block accuracy analysis and optimal
parameter selection in practice. In Section VI we introduce a
new framework allowing assessment of any block formats with
respect to the optimal behavior of SBFP and show how such
a methodology can be used to determine the optimal block
size for any block format and dataset. Section VII provides
extensive numerical simulations using synthetic data as well
as publicly available pretrained neural networks supporting our
theoretical findings and conclusions. The technical details of
the proofs can be found in the Appendix.
II. BLOCK QUANTIZATION
Recent extensive studies of popular DNNs based on the
transformer mechanism have shown that their trained weights
often have comparable amplitudes [16, 17]. This observation
naturally leads to the following simple compression algorithm.
Given a list of weights, we can store some common scaling
factor in high precision and a set of small precision integers
for each element of the list. This idea is formalized by the
family of block formats studied in this work.
Definition 1 (Block Format).A block format is a triple of
a block size nN, mantissa precision pN, and a
numerical format for the scaling factor S. Given such a triple,
all the elements {Mi}n
i=1 of a block are stored as integers aka
mantissas in the range of (2p11),2p11, and their
values are computed as
{S·M1, . . . , S ·Mn}.(1)
The block size nand precision pusually play the role of
degrees of freedom or format parameters that can be altered.
Therefore, a block format is determined by the scale factor
numerical format.
A. SBFP Quantization
Let us fix the block size nand mantissa precision pfor a
moment. Assume we are given nreal numbers X1, . . . , Xn
Rthat we want to store in a block. Denote their maximal
absolute value by
Y=n
max
i=1 |Xi|,(2)
and set
α= 2p11.(3)
Let us rescale the Xivalues as
Zi=αXi
Y[α, α], i = 1, . . . , n. (4)
Next we round every Zito the closest integer and these will
be their corresponding mantissas, I[Zi]. The Scaled Block
Floating Point (SBFP) representation is now characterized by
the scale Ywhich is stored in full precision and nintegers
I[Zi], i = 1, . . . , n stored using pbits each.
B. BFP Quantization
For the plain Block Floating Point (BFP) quantization
format, the way we memorize the scale changes. Here, instead
of storing the maximal absolute value we keep its nearest
power of 2from above,
Y
α= 2dlog2Y
αe,(5)
where in the right-hand side we used the standard notation
for the ceiling function. As a consequence, quite often the
new scalar will be significantly larger than the optimal choice
which might have detrimental effect on the resulting accuracy,
as we shall see later.
III. STATISTICAL MODEL
As mentioned earlier, one of the central problems in the
modern hardware development focusing at large-scale DNNs
is designing processing units capable of accurate matrix multi-
plication under heavy memory, compute, and power limitations
[6–10]. An inherent part of this endeavor is examining the
accuracy of the product as a function of the numerical format
used to quantize the multipliers. In this work, we focus on
the elementary building block of matrix multiplication - the
scalar product of two vectors. To study the properties of the
SBFP and BFP quantized inner products systematically and
formulate universally meaningful claims, we have to exploit
well-specified populations of weights that would be both use-
ful in practice and amenable to rigorous derivations. We chose
to use normally distributed data. First, normal populations well
approximate real distributions of weights in neural networks
because the latter are usually initialized normally at random
and on average shift only slightly during the training process
[19]. Second, normal distributions are easy to deal with from
both theoretical and practical points of view. Namely, quite
often theoretical results involving normal populations allow
solution by quadrature yielding meaningful expressions that
provide important insights into the nature of the observed
phenomena, like in our case. In addition, normal variables are
easy to simulate due a huge number of existing algorithms
generating normal pseudo-random numbers of good quality.
Assume we have two vectors nX(j)
ion
i=1 , j = 1,2of n
random variables which are i.i.d. (independent and identically
distributed) normal of mean zero and variance σ2each. After
3
quantizing them into a block format using the scheme de-
scribed in Section II-A above, denote the absolute quantization
error of each element by
Z(j)
i=Z(j)
i− I hZ(j)
ii, i = 1, . . . , n, j = 1,2.(6)
Note that for each jthere exists an index i(j)
, for which Z(j)
iis
equal to αjor αjand is therefore nonrandom. Note also that
already for moderate values of precision pj, we can assume
without loss of accuracy that Z(j)
i, i 6=i(j)
are uniformly
distributed within 1
2,1
2.
For the SBFP format study, the quantity we shall be focusing
on below is the absolute error of the scalar product of two
blocks,
Es=X
i
X(1)
iX(2)
iY(1)
α1
Y(2)
α2X
iIhZ(1)
iiIhZ(2)
ii,
(7)
where the subscript sstands for SBFP quantization. An
analogous error for BFP will be denoted as Eb. In this paper
we will be focusing on the study of the statistical properties
of these errors as a function of block format parameters, such
as precision pand block size n.
IV. ASYMPTOTIC BOUNDS
We start by deriving the limiting bounds on the approx-
imation quality of SBFP and BFP formats when the block
size ntends to infinity. Such results will provide a number of
useful insights into the quantization properties of the formats at
hand. Later we refine them for finite block sizes. To rigorously
formulate the results will use the following auxiliary definition.
Definition 2 ([20]).We shall say that a real centered random
variable Wis sub-Gaussian with variance proxy σ2>0if its
moment generating function can be bounded as
EesW 6eσ2s2
2,sR.(8)
As follows from the name, the statistical properties of sub-
Gaussian random variables mimic those of Gaussian ones.
Lemma 4 from the Appendix lists a few such properties that
will be useful in developing the bounds at question. For a
more complete review of sub-Gaussianity we refer the reader
to [20] and the references cited therein.
A. SBFP Quantization
Proposition 1 (Asymptotic SBFP Bound).For 1ln n
maxj2pj, the variance of the error of the inner product of
SBFP-quantized vectors can be bounded by
var [∆Es]6σ4
81
22(p11) +1
22(p21) nln 4n2
2πln(2n2),
(9)
moreover Esis a symmetric sub-Gaussian random variable
with variance proxy 2var [∆Es]and the tail bound
P[|Es|>t]62 exp t2
4var [∆Es].(10)
Proof. The proof can be found in the Appendix.
Following the same line as above, we can get a similar
asymptotic result for the BFP format as well. For simplicity
of notation we assume p1=p2.
Proposition 2 (Asymptotic BFP Bound).If p1=p2=pand
1ln n2p, the variance of the error of the inner product
of BFP-quantized vectors can be bounded by
var [∆Eb]6σ2
4n22llog2(σ
2p1)+1
2log2ln4n2
2πln(2n2)m,(11)
moreover Ebis a symmetric sub-Gaussian random variable
with variance proxy 2var [∆Eb]and the tail bound
P[|Eb|>t]62 exp t2
4var [∆Eb].(12)
Proof. The proof can be found in the Appendix.
A few notes are in place here. First, we see that the behavior
of the bounds as functions of the mantissas precision are
similar, namely both decrease with pjexponentially. This
observation is confirmed by extensive numerical experiments
in Section VII below. Second, the dependency of the variance
on the block size ncan be broken into two multiplicative
factors: 1) the linear dependency on nand 2) the logarith-
mic dependency on n. The linear dependency is due to the
proportional growth of the variance of the error with the
length of the vectors being multiplied. Indeed, summing up
more i.i.d. variables naturally leads to the linear grows of the
variance. The logarithmic multiplier is more surprising. In fact,
it comes from the distribution of the maximum of Gaussian
random variables as explained in the Appendix section. Here
we only mention that given ni.i.d. standard normal random
variables, the expected maximum of the absolute values of
these variables grows approximately as 2logn. Since the
variance is proportional to the square of the expected block
scales - as explained in (7) above and Lemma 6 below - its
growth with nbecomes logarithmic. All these observations are
also verified by the numerical experiments of Section VII-A.
Note also that sub-Gaussianity of the errors imply their tight
concentration around the mean. This implies that the obtained
bounds, although asymptotic, are quite tight.
It is important to emphasize the presence of the ceiling
function in the expression (11) of the BFP bound. This leads to
jumps in the values of the bound at those block sizes, at which
the expected values of the scales Yn
αpass over integer powers
of two. A more detailed discussion supported by numerical
simulations can be found in Section VII-A below.
V. HIGH-DIMENSIONAL BOUNDS
Numerical simulations of Section VII-A show that for finite
block sizes nthe bound obtained in Proposition 2 becomes
quite loose. In this section, we provide finite-dimensional
improvements of the variance bounds from Propositions 1
and 2. Such improvements however come at a price. Here we
already cannot solve the result by quadrature and are forced to
use numerical integration for our simulations in Section VII.
Lemma 1. Let Φand φdenote the cdf and pdf of the standard
normal distribution, respectively, and for some σ2>0Xi
摘要:

1BlockFormatErrorBoundsandOptimalBlockSizeSelectionIlyaSoloveychik,IlyaLyubomirsky,XinWangandSudeepBhojad-MatrixAbstract—Theamountsofdatathatneedtobetransmitted,processed,andstoredbythemoderndeepneuralnetworkshavereachedtrulyenormousvolumesinthelastfewyearscallingfortheinventionofnewparadigmsbothinh...

展开>> 收起<<
1 Block Format Error Bounds and Optimal Block Size Selection.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:846.74KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注