Iterative Methods at Lower Precision Yizhou Chen Xiaoyun Gong Xiang Ji Guest Editor James Nagy

2025-05-03 0 0 2.17MB 21 页 10玖币
侵权投诉
Iterative Methods at Lower Precision
Yizhou Chen, Xiaoyun Gong, Xiang Ji
Guest Editor: James Nagy
October 11, 2022
Abstract
Since numbers in the computer are represented with a fixed number of bits, loss of accuracy
during calculation is unavoidable. At high precision where more bits (e.g. 64) are allocated
to each number, round-off errors are typically small. On the other hand, calculating at lower
precision, such as half (16 bits), has the advantage of being much faster. This research focuses on
experimenting with arithmetic at different precision levels for large-scale inverse problems, which
are represented by linear systems with ill-conditioned matrices. We modified the Conjugate
Gradient Method for Least Squares (CGLS) and the Chebyshev Semi-Iterative Method (CS)
with Tikhonov regularization to do arithmetic at lower precision using the MATLAB chop
function, and we ran experiments on applications from image processing and compared their
performance at different precision levels. We concluded that CGLS is a more stable algorithm,
but overflows easily due to the computation of inner products, while CS is less likely to overflow
but it has more erratic convergence behavior. When the noise level is high, CS outperforms
CGLS by being able to run more iterations before overflow occurs; when the noise level is close
to zero, CS appears to be more susceptible to accumulation of round-off errors.
1 Introduction
Most computer processors today use double-precision binary floating point arithmetic, which rep-
resents floating-point numbers with 64 bits. The convention follows from the IEEE-754 standard
established in 1985, specifying the number of bits for the sign, exponent, and mantissa (fraction)
for various floating-point formats including binary16 (half precision), binary32 (single precision),
and binary64 (double precision). The difference between these formats is illustrated in the diagram
below [4]:
jnagy@emory.edu
1
arXiv:2210.03844v1 [math.NA] 7 Oct 2022
Taking a quarter of memory of their traditional 64-bit counterparts, computing with half precision
numbers has attracted interest from researchers and manufacturers because of the potential for sig-
nificantly reduced computational time. The format accelerates deep learning training, allows larger
models, and improves gaming latency, providing players with better experiences [12]. However,
aside from these benefits, half precision also brings one obvious disadvantage: increased inaccu-
racies in number representation. One goal of this research project is to investigate the impact of
transferring from double precision operations to low precision arithmetic when solving large-scale
inverse problems.
Due to their wide variety of applications, we specifically utilized and modified codes for itera-
tive methods in solving inverse problems to evaluate the performance of low precision algorithms.
Inverse problems arise when using outside measurements to acquire information about internal or
hidden data [6]. We operate on known outputs with some errors to compute the true input value.
For example, X-ray computed tomography is a contactless imaging method for reconstructing tar-
get objects, most commonly, pathologies in the human body. Another example is image deblurring
problems, which occur when the true picture is to be reconstructed from its blurry (and sometimes
noisy) version [2]. These cases can be abstracted as linear systems b=Ax+e, where Ais a
large-scale, typically ill-conditioned matrix and bis a vector output blended with noise, e.
A naive solution to this problem can be obtained by directly calculating a solution to Ax=b.
However, this naive solution is often corrupted by noise due to the ill-conditioning of matrix A.
Regularization methods are often needed to balance signal and noise. One approach for regular-
ization uses the singular value decomposition of A. However, when the matrix Ais large, as is
often the case for many real-life applications (including our test cases), such direct methods may
be difficult to implement, since decomposing the matrix can be very computationally costly. When
the matrix is sparse, meaning it contains a lot of zero entries, computing matrix-vector products
can be done very efficiently. Therefore, for such problems, implementing iterative methods, which
only require matrix-vector products and vector operations, is a more efficient way to solve inverse
problems than direct methods, with the former requiring O(n3) work for an n×nmatrix A, while
the latter involves significantly less than O(n2) work per iteration [13].
2
2 Simulating Low Precision
2.1 Chop
The MATLAB function chop written by Higham and Pranesh provides users a way to simulate
low precision numbers and arithmetic. The function can simulate precisions, such as fp16, bfloat16,
as well as custom formats where the user is able to choose the number of bits in the significand
and the maximum value of the exponent. The function doesn’t create the new data type; instead,
it keeps numbers in double or single precision and makes sure that they have the right scale as
in lower precision. Therefore, once the rounding meets the bit limit of the corresponding target
precision, the remaining bits are set to zero. Computer operations are also carried out in double
or single precision, and after that, the result is rounded to low precision [10].
Fully simulating computations in low precision requires users to call chop after each arithmetic
operation. For example, for the operation
a=x+y×z,
the appropriate use of chop is:
a=chop(x+chop(y×z)).
This is burdensome yet unavoidable, so our modified version of iterative methods using chop, which
only simulates low precision arithmetic, usually takes a long time to run, despite the fact that low
precision arithmetic would run significantly faster with the corresponding hardware. However, for
vector or matrix operations, there are ways to reduce the number of calls to chop, improving the
efficiency of our codes. For example, for the vector inner product between xand y, instead of using
the line:
sum = 0;
for i = 1:vector_length
sum = chop(sum + chop(x(i) * y(i)));
end
we use an element-wise operation at the beginning:
sum = 0;
z = chop(x.*y)
for i = 1:vector_length
sum = chop(sum + z(i));
end
such that we successfully reduce 2ncalls of chop to only n+ 1 calls, accelerating our running
process.
2.2 Blocking
Since we are using floating-point arithmetic, inaccuracy is inevitable in computations. However,
blocking can be used to reduce the error bound. The method breaks a large number of operations
3
Figure 1: Relationship between errors and block size.
into several smaller pieces, computes them independently, and sums them up. Consider the inner
product between two vectors:
x= [x1, x2, x3, x4, x5, x6],y= [y1, y2, y3, y4, y5, y6]
Instead of calculating it directly, we could break it into:
x1= [x1, x2, x3],x2= [x4, x5, x6]
y1= [y1, y2, y3],y2= [y4, y5, y6]
Then we calculate xT
1y1and xT
2y2, and add the sum. The result can have less errors than the
direct calculation. The error bound for inner products xTyis [8]:
|xTyfl(xTy)| ≤ γn|x|T|y|
where γn=nu
1nu and uis the unit round-off of floating point arithmetic (e.g. 9.77 104for half
precision, 1.19 107for single precision, and 2.22 1016 for double precision). Nonetheless, with
blocking, the error bound is reduced to [8]:
|snˆsn| ≤ γ(log2n)+1|x|T|y|.
snis the summation of the results for each block using the true value of xand y, whereas ˆsnis the
summation of each block’s result using the floating number representation of xand y. The detailed
proof is in [8, Chapter 3]. We plotted the error graph for double, single, and half precision against
block size when computing the inner product of two random vectors in Figure 1. We did inner
products for each precision 20 times and calculated the average error, which is computed as the
difference between the result from our chopped version of inner products and MATLAB’s default
double precision computation.
The graph on the left-hand-side includes all three precision levels, and we can see that the error for
half precision is the largest, since it has the least bits. Focusing on half precision only, the graph on
4
the right shows that errors decrease sharply as blocking is introduced. However, the errors increase
as the block size increases, indicating that an appropriate block size needs to be carefully chosen.
The reason behind this increase is that as the block size grows larger, it has the similar effect as
doing no blocking; the whole vector or matrix is put into the first block, not divided into smaller
sections. In our modified codes, we chose the block size to be 256, appropriate for the 4096 ×4096
matrices we used in the test cases.
3 Conjugate Gradient Method for Least Squares
3.1 Method Overview
The Conjugate Gradient (CG) Method was introduced by Hestenes and Stiefel [7]. It is an iterative
method for solving the linear system Ax=b, where Ais a symmetric and positive definite matrix.
The CG method can be viewed as an optimization problem of minimizing a convex quadratic
function:
φ(x) = 1
2xTAxxTb.
The gradient is zero at minimum point, meaning φ(x) = Axb= 0, which is exactly the linear
system we are trying to solve.
The CG method is a Krylov subspace method, which means its approximated solutions lie in
Krylov subspaces. In each iteration, xis allowed to explore subspaces of increasing dimensions.
An interesting property of CG is that during each iteration, xis updated to the point within the
subspaces where the A-norm of the error is minimized [13]. If we assume zero round-off errors, CG
is guaranteed to converge within a limited number of steps. Specifically, if Ais an n×nmatrix,
the method will find the solution in no greater than nsteps. Besides, the updated xis closer to
the true solution than the previous xin each iteration.
The CGLS algorithm is the least squares version of the CG method, applied to the normal equa-
tions ATAx=ATb.
One potential problem with this method for low precision is that the calculation of inner prod-
ucts can easily result in overflow, as illustrated in the experiment section below.
The following algorithm describes CGLS [1].
3.2 Experiment
With the chop function, we modified the original CGLS method in the IRtools package, which
offers various iterative methods for large-scale, ill-posed inverse problems and a set of test problems
for these iterative methods [3]. Then we used two large-scale, ill-posed inverse problems from the
same package: image deblurring and tomography reconstruction. The first is to reconstruct an
approximation of the true image from the observed blurry version, whereas the second one is to
reconstruct an image from measured projections, which can be obtained, for example, from X-ray
beams. We investigated these two test problems in different sizes, floating-point precision levels,
5
摘要:

IterativeMethodsatLowerPrecisionYizhouChen,XiaoyunGong,XiangJiGuestEditor:JamesNagy*October11,2022AbstractSincenumbersinthecomputerarerepresentedwitha xednumberofbits,lossofaccuracyduringcalculationisunavoidable.Athighprecisionwheremorebits(e.g.64)areallocatedtoeachnumber,round-o errorsaretypicallys...

展开>> 收起<<
Iterative Methods at Lower Precision Yizhou Chen Xiaoyun Gong Xiang Ji Guest Editor James Nagy.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:2.17MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注