Iterative Methods at Lower Precision Yizhou Chen Xiaoyun Gong Xiang Ji Guest Editor James Nagy

2025-05-03 0 0 2.17MB 21 页 10玖币

侵权投诉

Iterative Methods at Lower Precision

Yizhou Chen, Xiaoyun Gong, Xiang Ji

Guest Editor: James Nagy∗

October 11, 2022

Abstract

Since numbers in the computer are represented with a ﬁxed number of bits, loss of accuracy

during calculation is unavoidable. At high precision where more bits (e.g. 64) are allocated

to each number, round-oﬀ errors are typically small. On the other hand, calculating at lower

precision, such as half (16 bits), has the advantage of being much faster. This research focuses on

experimenting with arithmetic at diﬀerent precision levels for large-scale inverse problems, which

are represented by linear systems with ill-conditioned matrices. We modiﬁed the Conjugate

Gradient Method for Least Squares (CGLS) and the Chebyshev Semi-Iterative Method (CS)

with Tikhonov regularization to do arithmetic at lower precision using the MATLAB chop

function, and we ran experiments on applications from image processing and compared their

performance at diﬀerent precision levels. We concluded that CGLS is a more stable algorithm,

but overﬂows easily due to the computation of inner products, while CS is less likely to overﬂow

but it has more erratic convergence behavior. When the noise level is high, CS outperforms

CGLS by being able to run more iterations before overﬂow occurs; when the noise level is close

to zero, CS appears to be more susceptible to accumulation of round-oﬀ errors.

1 Introduction

Most computer processors today use double-precision binary ﬂoating point arithmetic, which rep-

resents ﬂoating-point numbers with 64 bits. The convention follows from the IEEE-754 standard

established in 1985, specifying the number of bits for the sign, exponent, and mantissa (fraction)

for various ﬂoating-point formats including binary16 (half precision), binary32 (single precision),

and binary64 (double precision). The diﬀerence between these formats is illustrated in the diagram

below [4]:

∗jnagy@emory.edu

arXiv:2210.03844v1 [math.NA] 7 Oct 2022

Taking a quarter of memory of their traditional 64-bit counterparts, computing with half precision

numbers has attracted interest from researchers and manufacturers because of the potential for sig-

niﬁcantly reduced computational time. The format accelerates deep learning training, allows larger

models, and improves gaming latency, providing players with better experiences [12]. However,

aside from these beneﬁts, half precision also brings one obvious disadvantage: increased inaccu-

racies in number representation. One goal of this research project is to investigate the impact of

transferring from double precision operations to low precision arithmetic when solving large-scale

inverse problems.

Due to their wide variety of applications, we speciﬁcally utilized and modiﬁed codes for itera-

tive methods in solving inverse problems to evaluate the performance of low precision algorithms.

Inverse problems arise when using outside measurements to acquire information about internal or

hidden data [6]. We operate on known outputs with some errors to compute the true input value.

For example, X-ray computed tomography is a contactless imaging method for reconstructing tar-

get objects, most commonly, pathologies in the human body. Another example is image deblurring

problems, which occur when the true picture is to be reconstructed from its blurry (and sometimes

noisy) version [2]. These cases can be abstracted as linear systems b=Ax+e, where Ais a

large-scale, typically ill-conditioned matrix and bis a vector output blended with noise, e.

A naive solution to this problem can be obtained by directly calculating a solution to Ax=b.

However, this naive solution is often corrupted by noise due to the ill-conditioning of matrix A.

Regularization methods are often needed to balance signal and noise. One approach for regular-

ization uses the singular value decomposition of A. However, when the matrix Ais large, as is

often the case for many real-life applications (including our test cases), such direct methods may

be diﬃcult to implement, since decomposing the matrix can be very computationally costly. When

the matrix is sparse, meaning it contains a lot of zero entries, computing matrix-vector products

can be done very eﬃciently. Therefore, for such problems, implementing iterative methods, which

only require matrix-vector products and vector operations, is a more eﬃcient way to solve inverse

problems than direct methods, with the former requiring O(n3) work for an n×nmatrix A, while

the latter involves signiﬁcantly less than O(n2) work per iteration [13].

2 Simulating Low Precision

2.1 Chop

The MATLAB function chop written by Higham and Pranesh provides users a way to simulate

low precision numbers and arithmetic. The function can simulate precisions, such as fp16, bﬂoat16,

as well as custom formats where the user is able to choose the number of bits in the signiﬁcand

and the maximum value of the exponent. The function doesn’t create the new data type; instead,

it keeps numbers in double or single precision and makes sure that they have the right scale as

in lower precision. Therefore, once the rounding meets the bit limit of the corresponding target

precision, the remaining bits are set to zero. Computer operations are also carried out in double

or single precision, and after that, the result is rounded to low precision [10].

Fully simulating computations in low precision requires users to call chop after each arithmetic

operation. For example, for the operation

a=x+y×z,

the appropriate use of chop is:

a=chop(x+chop(y×z)).

This is burdensome yet unavoidable, so our modiﬁed version of iterative methods using chop, which

only simulates low precision arithmetic, usually takes a long time to run, despite the fact that low

precision arithmetic would run signiﬁcantly faster with the corresponding hardware. However, for

vector or matrix operations, there are ways to reduce the number of calls to chop, improving the

eﬃciency of our codes. For example, for the vector inner product between xand y, instead of using

the line:

sum = 0;

for i = 1:vector_length

sum = chop(sum + chop(x(i) * y(i)));

end

we use an element-wise operation at the beginning:

sum = 0;

z = chop(x.*y)

for i = 1:vector_length

sum = chop(sum + z(i));

end

such that we successfully reduce 2ncalls of chop to only n+ 1 calls, accelerating our running

process.

2.2 Blocking

Since we are using ﬂoating-point arithmetic, inaccuracy is inevitable in computations. However,

blocking can be used to reduce the error bound. The method breaks a large number of operations

Figure 1: Relationship between errors and block size.

into several smaller pieces, computes them independently, and sums them up. Consider the inner

product between two vectors:

x= [x1, x2, x3, x4, x5, x6],y= [y1, y2, y3, y4, y5, y6]

Instead of calculating it directly, we could break it into:

x1= [x1, x2, x3],x2= [x4, x5, x6]

y1= [y1, y2, y3],y2= [y4, y5, y6]

Then we calculate xT

1y1and xT

2y2, and add the sum. The result can have less errors than the

direct calculation. The error bound for inner products xTyis [8]:

|xTy−fl(xTy)| ≤ γn|x|T|y|

where γn=nu

1−nu and uis the unit round-oﬀ of ﬂoating point arithmetic (e.g. 9.77 ∗10−4for half

precision, 1.19 ∗10−7for single precision, and 2.22 ∗10−16 for double precision). Nonetheless, with

blocking, the error bound is reduced to [8]:

|sn−ˆsn| ≤ γ(log2n)+1|x|T|y|.

snis the summation of the results for each block using the true value of xand y, whereas ˆsnis the

summation of each block’s result using the ﬂoating number representation of xand y. The detailed

proof is in [8, Chapter 3]. We plotted the error graph for double, single, and half precision against

block size when computing the inner product of two random vectors in Figure 1. We did inner

products for each precision 20 times and calculated the average error, which is computed as the

diﬀerence between the result from our chopped version of inner products and MATLAB’s default

double precision computation.

The graph on the left-hand-side includes all three precision levels, and we can see that the error for

half precision is the largest, since it has the least bits. Focusing on half precision only, the graph on

the right shows that errors decrease sharply as blocking is introduced. However, the errors increase

as the block size increases, indicating that an appropriate block size needs to be carefully chosen.

The reason behind this increase is that as the block size grows larger, it has the similar eﬀect as

doing no blocking; the whole vector or matrix is put into the ﬁrst block, not divided into smaller

sections. In our modiﬁed codes, we chose the block size to be 256, appropriate for the 4096 ×4096

matrices we used in the test cases.

3 Conjugate Gradient Method for Least Squares

3.1 Method Overview

The Conjugate Gradient (CG) Method was introduced by Hestenes and Stiefel [7]. It is an iterative

method for solving the linear system Ax=b, where Ais a symmetric and positive deﬁnite matrix.

The CG method can be viewed as an optimization problem of minimizing a convex quadratic

function:

φ(x) = 1

2xTAx−xTb.

The gradient is zero at minimum point, meaning ∇φ(x) = Ax−b= 0, which is exactly the linear

system we are trying to solve.

The CG method is a Krylov subspace method, which means its approximated solutions lie in

Krylov subspaces. In each iteration, xis allowed to explore subspaces of increasing dimensions.

An interesting property of CG is that during each iteration, xis updated to the point within the

subspaces where the A-norm of the error is minimized [13]. If we assume zero round-oﬀ errors, CG

is guaranteed to converge within a limited number of steps. Speciﬁcally, if Ais an n×nmatrix,

the method will ﬁnd the solution in no greater than nsteps. Besides, the updated xis closer to

the true solution than the previous xin each iteration.

The CGLS algorithm is the least squares version of the CG method, applied to the normal equa-

tions ATAx=ATb.

One potential problem with this method for low precision is that the calculation of inner prod-

ucts can easily result in overﬂow, as illustrated in the experiment section below.

The following algorithm describes CGLS [1].

3.2 Experiment

With the chop function, we modiﬁed the original CGLS method in the IRtools package, which

oﬀers various iterative methods for large-scale, ill-posed inverse problems and a set of test problems

for these iterative methods [3]. Then we used two large-scale, ill-posed inverse problems from the

same package: image deblurring and tomography reconstruction. The ﬁrst is to reconstruct an

approximation of the true image from the observed blurry version, whereas the second one is to

reconstruct an image from measured projections, which can be obtained, for example, from X-ray

beams. We investigated these two test problems in diﬀerent sizes, ﬂoating-point precision levels,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IterativeMethodsatLowerPrecisionYizhouChen,XiaoyunGong,XiangJiGuestEditor:JamesNagy*October11,2022AbstractSincenumbersinthecomputerarerepresentedwithaxednumberofbits,lossofaccuracyduringcalculationisunavoidable.Athighprecisionwheremorebits(e.g.64)areallocatedtoeachnumber,round-oerrorsaretypicallys...

展开>> 收起<<

Iterative Methods at Lower Precision Yizhou Chen Xiaoyun Gong Xiang Ji Guest Editor James Nagy.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Iterative Methods at Lower Precision Yizhou Chen Xiaoyun Gong Xiang Ji Guest Editor James Nagy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: