
where all terms are positive except possibly the sampling luck, which is zero on average, has a
standard deviation shrinking with data size |D|according to the Poisson scaling |D|−1/2, and will
be ignored in the present paper. The generalization gap has been extensively studied in prior work,
so this paper will focus exclusively on the optimization error and the architecture error.
To summarize: the architecture error is the best possible performance that a given architecture
can achieve on the task, the generalization gap is the difference between the optimal performance
on the training set Dand the architecture error, and the optimization error is the error introduced
by imperfect optimization – the difference between the error on the training set found by imperfect
optimization and the optimal error on the training set. When comparing methods and studying their
scaling, it useful to ask which of these error sources dominate. We will see that both architecture
error and optimization error can be quite important in the high-precision regime, as we will elaborate
on in Sections 2-3and Section 4, respectively.
1.3 Importance of Scaling Exponents
In this work, one property that we focus on is how methods scale as we increase parameters or
training data. This builds on a recent body of work on scaling laws in deep learning [8,9,10,11,
12,13,14,15,16] which has found that, on many tasks, loss decreases predictably as a power-law
in the number of model parameters and amount of training data. Attempting to understand this
scaling behavior, [17,18] argue that in some regimes, cross-entropy and MSE loss should scale as
N−α, where α&4/d,Nis the number of model parameters, and dis the intrinsic dimensionality
of the data manifold of the task.
Consider the problem of approximating some analytic function f: [0,1]d→Rwith some function
which is a piecewise n-degree polynomial. If one partitions a hypercube in Rdinto regions of length
and approximates fas a n-degree polynomial in each region (requiring N=O(1/d) parameters),
absolute error in each region will be O(n+1) (given by the degree-(n+1) term in the Taylor expansion
of f) and so absolute error scales as N−n+1
d. If neural networks use ReLU activations, they are
piecewise linear, n= 1 and so we may expect `rmse(N)∝N−2
d. However, in line with [17], we
find that ReLU NNs often scale as if the problem was lower-dimensional than the input dimension,
though we suggest that this is a result of the computational modularity of the problems in our
setting, rather than a matter of low intrinsic dimensionality (though these perspectives are related).
If one desires very low loss, then the exponent α, the rate at which methods approach their
best possible performance1matters a great deal. Kaplan et al. [17] note that 4/d is merely a lower-
bound on the scaling rate – we consider ways that neural networks can improve on this bound.
Understanding model scaling is key to understanding the feasibility of achieving high precision.
1.4 Organization
This paper is organized as follows: In Section 2we discuss piecewise linear approximation methods,
comparing ReLU networks with linear simplex interpolation. We find that neural networks can
sometimes outperform simplex interpolation, and suggest that they do this by discovering modular
structure in the data. In Section 3we discuss nonlinear methods, including neural networks with
nonlinear activation functions. In Section 4we discuss the optimization challenges of high-precision
neural network training – how optimization difficulties can often make total error far worse than the
limits of what architecture error allows. We attempt to develop optimization methods for overcoming
these problems and describe their limitations, then conclude in Section 5.
1The best possible performance can be determined either by precision limits or by noise intrinsic to the problem,
such as intrinsic entropy of natural language.
3