Precision Machine Learning Eric J. Michaud1 2 Ziming Liu1 2 and Max Tegmark1 2 3 1Department of Physics MIT

2025-05-02 0 0 3.35MB 15 页 10玖币
侵权投诉
Precision Machine Learning
Eric J. Michaud1, 2, Ziming Liu1, 2, and Max Tegmark1, 2, 3
1Department of Physics, MIT
2NSF AI Institute for AI and Fundamental Interactions
3Center for Brains, Minds and Machines
October 24, 2022
Abstract
We explore unique considerations involved in fitting ML models to data with very high
precision, as is often required for science applications. We empirically compare various func-
tion approximation methods and study how they scale with increasing parameters and data.
We find that neural networks can often outperform classical approximation methods on high-
dimensional examples, by auto-discovering and exploiting modular structures therein. However,
neural networks trained with common optimizers are less powerful for low-dimensional cases,
which motivates us to study the unique properties of neural network loss landscapes and the
corresponding optimization challenges that arise in the high precision regime. To address the
optimization issue in low dimensions, we develop training tricks which enable us to train neural
networks to extremely low loss, close to the limits allowed by numerical precision.
1 Introduction
Most machine learning practitioners do not need to fit their data with much precision. When
applying machine learning to traditional AI tasks such as in computer vision or natural language
processing, one typically does not desire to bring training loss all the way down to exactly zero, in
part because training loss is just a proxy for some other performance measure like accuracy that
one actually cares about, or because there is intrinsic uncertainty which makes perfect prediction
impossible, e.g., for language modeling. Accordingly, to save memory and speed up computation,
much work has gone into reducing the numerical precision used in models without sacrificing model
performance much [1,2,3]. However, modern machine learning methods, and deep neural networks
in particular, are now increasingly being applied to science problems, for which being able to fit
models very precisely to (high-quality) data can be important. Small absolute changes in loss can
make a big difference, e.g., for the symbolic regression task of identifying an exact formula from
data.
It is therefore timely to consider what, if any, unique considerations arise when attempting to fit
ML models very precisely to data, a regime we call Precision Machine Learning (PML). How does
pursuit of precision affect choice of method? How does optimization change in the high-precision
regime? Do otherwise-obscure properties of model expressivity or optimization come into focus when
one cares a great deal about precision? In this paper, we explore these basic questions.
1.1 Problem Setting
We study regression in the setting of supervised learning, in particular the task of fitting functions
f:RdRto a dataset of D={(~xi, yi=f(~xi)}|D|
i=1. In this work, we mostly restrict our focus to
functions fwhich are given by symbolic formulas. Such functions are appropriate for our purpose, of
studying precision machine learning for science applications, since they (1) are ubiquitous in science,
ericjm@mit.edu
1
arXiv:2210.13447v1 [cs.LG] 24 Oct 2022
fundamental to many fields’ descriptions of nature, (2) are precise, not introducing any intrinsic
noise in the data, making extreme precision possible, and (3) often have interesting structure such
as modularity that sufficiently clever ML methods should be able to discover and exploit. We use a
dataset of symbolic formulas from [4], collected from the Feynman Lectures on Physics [5].
Just how closely can we expect to fit models to data? When comparing a model prediction fθ(~xi)
to a data point yi, the smallest nonzero difference allowed is determined by the numerical precision
used. IEEE 754 64-bit floats [6] have 52 mantissa bits, so if yiand fθ(~xi) are of order unity, then the
smallest nonzero difference between them is 0= 252 1016. We should not expect to achieve
relative RMSE loss below 1016, where relative RMSE loss, on a dataset D, is:
`rms P|D|
i=1 |fθ(~xi)yi|2
P|D|
i=1 y2
i!1
2
=|fθ(~xi)yi|rms
yrms
.(1)
In practice, precision can be bottlenecked earlier by the computations performed within the model
fθ. The task of precision machine learning is to try to push the loss down many orders of magnitude,
driving `rms as close as possible to the numerical noise floor 0.
1.2 Decomposition of Loss
One can similarly define relative MSE loss `mse `2
rms, as well as non-relative (standard) MSE
loss Lmse(f) = 1
|D|PD
i=1(fθ(~xi)yi)2, and Lrms =Lmse. Minimizing `rms, `mse, Lrms, Lmse are
equivalent up to numerical errors. Note that (relative) expected loss can be defined on a probability
distribution P(Rd,R), like so:
`P
rms =E(~x,y)P[(fθ(~x)y)2]
E(~x,y)P[y2]1
2
.(2)
When we wish to emphasize the distinction between loss on a dataset D(empirical loss) and a dis-
tribution P(expected loss), we write `Dand `P. In the spirit of [7], we find it useful to decompose
sources of error into different sources, which we term optimization error,sampling luck, the gener-
alization gap, and architecture error. A given model architecture parametrizes a set of expressible
functions H. One can define three functions of interest within H:
fbest
Pargmin
f∈H {`P(f)},(3)
the best model on the expected loss `P,
fbest
Dargmin
f∈H {`D(f)},(4)
the best model on the empirical loss `D, and
fused
D=A(H, D, L),(5)
the model found by a given learning algorithm Awhich performs possibly imperfect optimization
to minimize empirical loss Lon D.
We can therefore decompose the empirical loss as follows:
`D(fused
D)=[`D(fused
D)`D(fbest
D)]
| {z }
optimization error
+ [`D(fbest
D)`P(fbest
D)]
| {z }
sampling luck
+ [`P(fbest
D)`P(fbest
P)]
| {z }
generalization gap
+`P(fbest
P)
| {z }
architecture error
,
(6)
2
where all terms are positive except possibly the sampling luck, which is zero on average, has a
standard deviation shrinking with data size |D|according to the Poisson scaling |D|1/2, and will
be ignored in the present paper. The generalization gap has been extensively studied in prior work,
so this paper will focus exclusively on the optimization error and the architecture error.
To summarize: the architecture error is the best possible performance that a given architecture
can achieve on the task, the generalization gap is the difference between the optimal performance
on the training set Dand the architecture error, and the optimization error is the error introduced
by imperfect optimization – the difference between the error on the training set found by imperfect
optimization and the optimal error on the training set. When comparing methods and studying their
scaling, it useful to ask which of these error sources dominate. We will see that both architecture
error and optimization error can be quite important in the high-precision regime, as we will elaborate
on in Sections 2-3and Section 4, respectively.
1.3 Importance of Scaling Exponents
In this work, one property that we focus on is how methods scale as we increase parameters or
training data. This builds on a recent body of work on scaling laws in deep learning [8,9,10,11,
12,13,14,15,16] which has found that, on many tasks, loss decreases predictably as a power-law
in the number of model parameters and amount of training data. Attempting to understand this
scaling behavior, [17,18] argue that in some regimes, cross-entropy and MSE loss should scale as
Nα, where α&4/d,Nis the number of model parameters, and dis the intrinsic dimensionality
of the data manifold of the task.
Consider the problem of approximating some analytic function f: [0,1]dRwith some function
which is a piecewise n-degree polynomial. If one partitions a hypercube in Rdinto regions of length
and approximates fas a n-degree polynomial in each region (requiring N=O(1/d) parameters),
absolute error in each region will be O(n+1) (given by the degree-(n+1) term in the Taylor expansion
of f) and so absolute error scales as Nn+1
d. If neural networks use ReLU activations, they are
piecewise linear, n= 1 and so we may expect `rmse(N)N2
d. However, in line with [17], we
find that ReLU NNs often scale as if the problem was lower-dimensional than the input dimension,
though we suggest that this is a result of the computational modularity of the problems in our
setting, rather than a matter of low intrinsic dimensionality (though these perspectives are related).
If one desires very low loss, then the exponent α, the rate at which methods approach their
best possible performance1matters a great deal. Kaplan et al. [17] note that 4/d is merely a lower-
bound on the scaling rate – we consider ways that neural networks can improve on this bound.
Understanding model scaling is key to understanding the feasibility of achieving high precision.
1.4 Organization
This paper is organized as follows: In Section 2we discuss piecewise linear approximation methods,
comparing ReLU networks with linear simplex interpolation. We find that neural networks can
sometimes outperform simplex interpolation, and suggest that they do this by discovering modular
structure in the data. In Section 3we discuss nonlinear methods, including neural networks with
nonlinear activation functions. In Section 4we discuss the optimization challenges of high-precision
neural network training – how optimization difficulties can often make total error far worse than the
limits of what architecture error allows. We attempt to develop optimization methods for overcoming
these problems and describe their limitations, then conclude in Section 5.
1The best possible performance can be determined either by precision limits or by noise intrinsic to the problem,
such as intrinsic entropy of natural language.
3
摘要:

PrecisionMachineLearningEricJ.Michaud*1,2,ZimingLiu1,2,andMaxTegmark1,2,31DepartmentofPhysics,MIT2NSFAIInstituteforAIandFundamentalInteractions3CenterforBrains,MindsandMachinesOctober24,2022AbstractWeexploreuniqueconsiderationsinvolvedin ttingMLmodelstodatawithveryhighprecision,asisoftenrequiredfors...

展开>> 收起<<
Precision Machine Learning Eric J. Michaud1 2 Ziming Liu1 2 and Max Tegmark1 2 3 1Department of Physics MIT.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:3.35MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注