1
Block Format Error Bounds
and Optimal Block Size Selection
Ilya Soloveychik, Ilya Lyubomirsky, Xin Wang and Sudeep Bhoja
d-Matrix
Abstract—The amounts of data that need to be transmitted,
processed, and stored by the modern deep neural networks have
reached truly enormous volumes in the last few years calling for
the invention of new paradigms both in hardware and software
development. One of the most promising and rapidly advancing
frontiers here is the creation of new numerical formats. In this
work we focus on the family of block floating point numerical
formats due to their combination of wide dynamic range,
numerical accuracy, and efficient hardware implementation of
inner products using simple integer arithmetic. These formats
are characterized by a block of mantissas with a shared scale
factor. The basic Block Floating Point (BFP) format quantizes
the block scales into the nearest powers of two on the right.
Its simple modification - Scaled BFP (SBFP) - stores the same
scales in full precision and thus allows higher accuracy. In this
paper, we study the statistical behavior of both these formats
rigorously. We develop asymptotic bounds on the inner product
error in SBFP- and BFP-quantized normally distributed vectors.
Next, we refine those asymptotic results to finite dimensional
settings and derive high-dimensional tight bounds for the same
errors. Based on the obtained results we introduce a performance
measure assessing accuracy of any block format. This measure
allows us to determine the optimal parameters, such as the block
size, yielding highest accuracy. In particular, we show that if the
precision of the BFP format is fixed at 4bits, the optimal block
size becomes 64. All theoretical derivations are supported by
numerical experiments and studies on the weights of publicly
available pretrained neural networks.
Index Terms—Deep leaning, in-memory compute, block float-
ing point, quantization.
I. INTRODUCTION
In the recent years, online services addressing requests
of billions of customers worldwide have become ubiquitous.
Such customization is usually provided by Deep Neural Net-
works (DNN) deployed at data centers and equipped with
unprecedented computational capabilities. The recently dis-
covered attention paradigm [1] has led to the explosion in the
variety and complexity of the modern high-performance DNNs
based on the transformer architecture [1–3]. Transformers
have achieved tremendous success in all Natural Language
Processing (NLP) tasks due to their ability to learn universal
language representations from large volumes of unlabeled text
data and then transfer this knowledge to downstream tasks
[4, 5]. Importantly, the most common operation in transformers
consuming the majority of the compute and power resources
is matrix multiplication [6, 7]. Creating the infrastructure that
will make the multiplication of the involved large matrices
faster and cheaper without impacting the product accuracy
has become the central focus of the leading researchers and
companies working in this field [8–10].
Enabling faster inference and more reliable training of
large DNNs is usually limited by the arithmetic bandwidth of
the underlying hardware [6, 7, 9, 10]. Quite often designers
and operators of transformer-based networks exploit Graph-
ics Processing Unit (GPUs) as the workhorse because they
offer higher arithmetic density per silicon area than Central
Processing Unit (CPUs) [6, 7]. Inference and training tasks
in all these cases are usually processed through full precision
floating-point units. However, the computational load required
by modern transformers has reached such enormous volumes
that traditional GPUs cannot fully meet the growing demand,
pushing both accelerators and high performance GPUs towards
narrow arithmetic. As a consequence, unmatched research
efforts have been applied by the engineering community to
replace narrow floating-point with even denser fixed-point rep-
resentations [11–14]. Despite the excellent gains in both speed
and computational density achieved by fixed-point arithmetic,
training using it or even half-precision floating-point arithmetic
has not provided clear evidence in its favor due to the limited
dynamic range inherent in such formats [15].
Block Floating Point (BFP) numerical formats have received
renewed interest recently for DNNs inference applications due
to their combination of wide dynamic range, numerical accu-
racy, and efficient hardware implementation of inner products
using simple integer arithmetic [16, 17]. BFP formats are
characterized by a block of mantissas with a shared scale
factor. The simplest implementation has the scale factor as a
power of two, the so-called exponent, in which case the inner
product between two blocks involves multiplying the integer
mantissas and adding the two block exponents.
Another useful version of the BFP format involves using a
floating point scale factor. The latter provides more numerical
accuracy at the expense of slightly increased complexity
for inner product operation since we need a floating point
multiplication of the scale factors [18]. In this paper, we
consider both formats calling the former simply BFP and the
latter SBFP, where the S stands for floating point scale factor.
Also, for simplicity and without much impact on the result,
in this study we treat the block scales of SBFP as infinite
precision numbers.
The contribution of this article is three-fold. First, we
develop asymptotic bounds on the inner product error in SBFP-
and BFP-quantized normally distributed vectors. Second, we
refine those asymptotic results to finite dimensional settings
and derive high-dimensional tight bounds for the same errors.
Third, we utilize the obtained results to introduce a perfor-
mance measure assessing the typical accuracy of any block
format. Such a framework enables us to determine the optimal
arXiv:2210.05470v3 [cs.LG] 7 Nov 2022