Precision Machine Learning Eric J. Michaud1 2 Ziming Liu1 2 and Max Tegmark1 2 3 1Department of Physics MIT

2025-05-02 0 0 3.35MB 15 页 10玖币

侵权投诉

Precision Machine Learning

Eric J. Michaud∗1, 2, Ziming Liu1, 2, and Max Tegmark1, 2, 3

1Department of Physics, MIT

2NSF AI Institute for AI and Fundamental Interactions

3Center for Brains, Minds and Machines

October 24, 2022

Abstract

We explore unique considerations involved in ﬁtting ML models to data with very high

precision, as is often required for science applications. We empirically compare various func-

tion approximation methods and study how they scale with increasing parameters and data.

We ﬁnd that neural networks can often outperform classical approximation methods on high-

dimensional examples, by auto-discovering and exploiting modular structures therein. However,

neural networks trained with common optimizers are less powerful for low-dimensional cases,

which motivates us to study the unique properties of neural network loss landscapes and the

corresponding optimization challenges that arise in the high precision regime. To address the

optimization issue in low dimensions, we develop training tricks which enable us to train neural

networks to extremely low loss, close to the limits allowed by numerical precision.

1 Introduction

Most machine learning practitioners do not need to ﬁt their data with much precision. When

applying machine learning to traditional AI tasks such as in computer vision or natural language

processing, one typically does not desire to bring training loss all the way down to exactly zero, in

part because training loss is just a proxy for some other performance measure like accuracy that

one actually cares about, or because there is intrinsic uncertainty which makes perfect prediction

impossible, e.g., for language modeling. Accordingly, to save memory and speed up computation,

much work has gone into reducing the numerical precision used in models without sacriﬁcing model

performance much [1,2,3]. However, modern machine learning methods, and deep neural networks

in particular, are now increasingly being applied to science problems, for which being able to ﬁt

models very precisely to (high-quality) data can be important. Small absolute changes in loss can

make a big diﬀerence, e.g., for the symbolic regression task of identifying an exact formula from

data.

It is therefore timely to consider what, if any, unique considerations arise when attempting to ﬁt

ML models very precisely to data, a regime we call Precision Machine Learning (PML). How does

pursuit of precision aﬀect choice of method? How does optimization change in the high-precision

regime? Do otherwise-obscure properties of model expressivity or optimization come into focus when

one cares a great deal about precision? In this paper, we explore these basic questions.

1.1 Problem Setting

We study regression in the setting of supervised learning, in particular the task of ﬁtting functions

f:Rd→Rto a dataset of D={(~xi, yi=f(~xi)}|D|

i=1. In this work, we mostly restrict our focus to

functions fwhich are given by symbolic formulas. Such functions are appropriate for our purpose, of

studying precision machine learning for science applications, since they (1) are ubiquitous in science,

∗ericjm@mit.edu

arXiv:2210.13447v1 [cs.LG] 24 Oct 2022

fundamental to many ﬁelds’ descriptions of nature, (2) are precise, not introducing any intrinsic

noise in the data, making extreme precision possible, and (3) often have interesting structure such

as modularity that suﬃciently clever ML methods should be able to discover and exploit. We use a

dataset of symbolic formulas from [4], collected from the Feynman Lectures on Physics [5].

Just how closely can we expect to ﬁt models to data? When comparing a model prediction fθ(~xi)

to a data point yi, the smallest nonzero diﬀerence allowed is determined by the numerical precision

used. IEEE 754 64-bit ﬂoats [6] have 52 mantissa bits, so if yiand fθ(~xi) are of order unity, then the

smallest nonzero diﬀerence between them is 0= 2−52 ∼10−16. We should not expect to achieve

relative RMSE loss below 10−16, where relative RMSE loss, on a dataset D, is:

`rms ≡ P|D|

i=1 |fθ(~xi)−yi|2

P|D|

i=1 y2

i!1

=|fθ(~xi)−yi|rms

yrms

.(1)

In practice, precision can be bottlenecked earlier by the computations performed within the model

fθ. The task of precision machine learning is to try to push the loss down many orders of magnitude,

driving `rms as close as possible to the numerical noise ﬂoor 0.

1.2 Decomposition of Loss

One can similarly deﬁne relative MSE loss `mse ≡`2

rms, as well as non-relative (standard) MSE

loss Lmse(f) = 1

|D|PD

i=1(fθ(~xi)−yi)2, and Lrms =√Lmse. Minimizing `rms, `mse, Lrms, Lmse are

equivalent up to numerical errors. Note that (relative) expected loss can be deﬁned on a probability

distribution P(Rd,R), like so:

rms =E(~x,y)∼P[(fθ(~x)−y)2]

E(~x,y)∼P[y2]1

.(2)

When we wish to emphasize the distinction between loss on a dataset D(empirical loss) and a dis-

tribution P(expected loss), we write `Dand `P. In the spirit of [7], we ﬁnd it useful to decompose

sources of error into diﬀerent sources, which we term optimization error,sampling luck, the gener-

alization gap, and architecture error. A given model architecture parametrizes a set of expressible

functions H. One can deﬁne three functions of interest within H:

fbest

P≡argmin

f∈H {`P(f)},(3)

the best model on the expected loss `P,

fbest

D≡argmin

f∈H {`D(f)},(4)

the best model on the empirical loss `D, and

fused

D=A(H, D, L),(5)

the model found by a given learning algorithm Awhich performs possibly imperfect optimization

to minimize empirical loss Lon D.

We can therefore decompose the empirical loss as follows:

`D(fused

D)=[`D(fused

D)−`D(fbest

D)]

| {z }

optimization error

+ [`D(fbest

D)−`P(fbest

D)]

| {z }

sampling luck

+ [`P(fbest

D)−`P(fbest

P)]

| {z }

generalization gap

+`P(fbest

| {z }

architecture error

(6)

where all terms are positive except possibly the sampling luck, which is zero on average, has a

standard deviation shrinking with data size |D|according to the Poisson scaling |D|−1/2, and will

be ignored in the present paper. The generalization gap has been extensively studied in prior work,

so this paper will focus exclusively on the optimization error and the architecture error.

To summarize: the architecture error is the best possible performance that a given architecture

can achieve on the task, the generalization gap is the diﬀerence between the optimal performance

on the training set Dand the architecture error, and the optimization error is the error introduced

by imperfect optimization – the diﬀerence between the error on the training set found by imperfect

optimization and the optimal error on the training set. When comparing methods and studying their

scaling, it useful to ask which of these error sources dominate. We will see that both architecture

error and optimization error can be quite important in the high-precision regime, as we will elaborate

on in Sections 2-3and Section 4, respectively.

1.3 Importance of Scaling Exponents

In this work, one property that we focus on is how methods scale as we increase parameters or

training data. This builds on a recent body of work on scaling laws in deep learning [8,9,10,11,

12,13,14,15,16] which has found that, on many tasks, loss decreases predictably as a power-law

in the number of model parameters and amount of training data. Attempting to understand this

scaling behavior, [17,18] argue that in some regimes, cross-entropy and MSE loss should scale as

N−α, where α&4/d,Nis the number of model parameters, and dis the intrinsic dimensionality

of the data manifold of the task.

Consider the problem of approximating some analytic function f: [0,1]d→Rwith some function

which is a piecewise n-degree polynomial. If one partitions a hypercube in Rdinto regions of length

and approximates fas a n-degree polynomial in each region (requiring N=O(1/d) parameters),

absolute error in each region will be O(n+1) (given by the degree-(n+1) term in the Taylor expansion

of f) and so absolute error scales as N−n+1

d. If neural networks use ReLU activations, they are

piecewise linear, n= 1 and so we may expect `rmse(N)∝N−2

d. However, in line with [17], we

ﬁnd that ReLU NNs often scale as if the problem was lower-dimensional than the input dimension,

though we suggest that this is a result of the computational modularity of the problems in our

setting, rather than a matter of low intrinsic dimensionality (though these perspectives are related).

If one desires very low loss, then the exponent α, the rate at which methods approach their

best possible performance1matters a great deal. Kaplan et al. [17] note that 4/d is merely a lower-

bound on the scaling rate – we consider ways that neural networks can improve on this bound.

Understanding model scaling is key to understanding the feasibility of achieving high precision.

1.4 Organization

This paper is organized as follows: In Section 2we discuss piecewise linear approximation methods,

comparing ReLU networks with linear simplex interpolation. We ﬁnd that neural networks can

sometimes outperform simplex interpolation, and suggest that they do this by discovering modular

structure in the data. In Section 3we discuss nonlinear methods, including neural networks with

nonlinear activation functions. In Section 4we discuss the optimization challenges of high-precision

neural network training – how optimization diﬃculties can often make total error far worse than the

limits of what architecture error allows. We attempt to develop optimization methods for overcoming

these problems and describe their limitations, then conclude in Section 5.

1The best possible performance can be determined either by precision limits or by noise intrinsic to the problem,

such as intrinsic entropy of natural language.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PrecisionMachineLearningEricJ.Michaud*1,2,ZimingLiu1,2,andMaxTegmark1,2,31DepartmentofPhysics,MIT2NSFAIInstituteforAIandFundamentalInteractions3CenterforBrains,MindsandMachinesOctober24,2022AbstractWeexploreuniqueconsiderationsinvolvedinttingMLmodelstodatawithveryhighprecision,asisoftenrequiredfors...

展开>> 收起<<

Precision Machine Learning Eric J. Michaud1 2 Ziming Liu1 2 and Max Tegmark1 2 3 1Department of Physics MIT.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Precision Machine Learning Eric J. Michaud1 2 Ziming Liu1 2 and Max Tegmark1 2 3 1Department of Physics MIT

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: