
that every Karush–Kuhn–Tucker (KKT) point is an approximated global minimizer of the
unconstrained training problem 4.
•
In real-data experiments, our proposed training regime can significantly outperforms SGD
for training narrow networks. We also perform ablation studies to show that the new elements
proposed in our method are useful.
2 Background and Related Works
The expressivity of neural networks has been a popular topic in machine learning for decades. There
are two lines of works: One focuses on the infinite-sample expressivity, showing what functions of
the entire domain can and cannot be represented by certain classes of neural networks (e.g. [
4
,
45
]).
Another line of works characterize the finite-sample expressivity, i.e. how many parameters are
required to memorize finite samples (e.g. [
5
,
13
,
21
,
25
,
26
,
42
,
75
]). The term “expressivity” in this
work means the finite-sample expressivity; see Definition 1 in Section 4.1. A major research question
regarding expressivity in the area is to show deep neural networks have much stronger expressivity
than the shallow ones (e.g. [
7
,
17
,
41
,
48
,
56
,
57
,
67
,
70
,
72
]). However, all these works neglect the
trainability.
In the finite-sample case, wide networks (width
poly(n)
) have both strong representation power
(i.e. the globally minimal training error is zero) and strong trainability (e.g. for wide enough nets,
Gradient Descent (GD) converges to global minima [
1
,
16
,
28
,
80
]). While these wide networks are
often called “over-parameterized”, we notice that the number of parameters of a width-
n
network
is actually at least
nd
, which is much larger than
n
. If comparing
n
with the number of parameters
(instead of neurons), the transition from under-parameterization and over-parameterization for a
one-hidden-layer fully-connected net (FCN) does not occur at width-
n
, but at width-
n/d
. In this
work, we will analyze networks with width in the range
[n/d, n)
, which we call “narrow networks”
(though rigorously speaking, we shall call them “narrow but still overparameterized networks”).
There are a few works on the trainability of narrow nets (one-hidden-layer networks with
m≥n/d
neurons). Soudry and Carmon [
63
], Xie et al. [
71
] show that for such networks, stationary points
with full-rank NTK (neural tangent kernel) are zero-loss global minima. However, it is not clear
whether the NTK stays full rank during the training trajectory. In addition, these two works do not
discuss whether a zero-loss global minimizer exists.
There are two interesting related works [
9
,
12
] pointed out by the reviewers. Bubeck et al. [
9
] study
how many neurons are required for memorizing a finite dataset by 1-hidden-layer networks. They
prove the following results. Their first result is an “existence” result: there exists a network with
width
m≥4n
d
which can memorize
n
input-label pairs (their Proposition 4). However, in this setting
they did not provide an algorithm to find the zero-loss solution. Their second result is related to
algorithms: they proposed a training algorithm that achieves accuracy up to error
for a neural net
with width
m≥On
d
log(1/)
. This result requires width dependent on the precision
; for instance,
when the desired accuracy
= 1/n
, the required width is at least
On2
d.
In contrast, in our work,
the required number of neurons is just 2n/d, which is independent of .
Daniely [
12
] also studies the expressivity and trainability of 1-hidden-layer networks. To memorize
n(1 −)
random data points via SGD, their required width is
˜
O(n/d)
. They assumed
n=dc
where
c > 0
is a fixed constant (appeared in Sec. 3.3 of [
12
]), in which case the hidden factor in
˜
O
is
O(log[d(log d)]c)
. In other words, if
n, d → ∞
with the scaling
n=dc
for a fixed constant
c
, then
their bound is roughly
O(n/d)
up to a log-factor. Nevertheless, for more general scaling of
n, d
, the
exponent
c= (log n)/(log d)
may not be a constant and the hidden factor may not be a log-factor
(e.g. for fixed
d
and
n→ ∞
). We tracked their proof and find that the width bound for general
n, d
is
O(n/d) (log(dlog n))log n/ log d
, which can be larger than
On2/d
(see detailed computation
and explanation in Appendix C). In contrast, our required width is
2n/d
for arbitrary
n
and
d
. Our
bound is always smaller than nwhen d > 2.
4
This result describes the loss landscape of the constrained optimization problem, not directly related to
algorithm convergence. Nevertheless, it is expected that first-order methods converge to KKT points and
thus approximate global minimizers. A rigorous convergence analysis may require verifying extra technical
conditions, which is left to future work.
3