When Expressivity Meets Trainability Fewer than n Neurons Can Work Jiawei Zhang

2025-05-06 0 0 1.35MB 39 页 10玖币
侵权投诉
When Expressivity Meets Trainability: Fewer than n
Neurons Can Work
Jiawei Zhang
Shenzhen Research Institute of Big Data
The Chinese University of Hong Kong,
Shenzhen, China
jiaweizhang2@link.cuhk.edu.cn
Yushun Zhang
Shenzhen Research Institute of Big Data
The Chinese University of Hong Kong,
Shenzhen, China
yushunzhang@link.cuhk.edu.cn
Mingyi Hong
University of Minnesota - Twin Citie
MN, USA
mhong@umn.edu
Ruoyu Sun
University of Illinois at Urbana-Champaign
IL, USA
ruoyus@illinois.edu
Zhi-Quan Luo
Shenzhen Research Institute of Big Data
The Chinese University of Hong Kong,
Shenzhen, China
luozq@cuhk.edu.cn
Abstract
Modern neural networks are often quite wide, causing large memory and com-
putation costs. It is thus of great interest to train a narrower network. However,
training narrow neural nets remains a challenging task. We ask two theoretical
questions: Can narrow networks have as strong expressivity as wide ones? If so,
does the loss function exhibit a benign optimization landscape? In this work, we
provide partially affirmative answers to both questions for 1-hidden-layer networks
with fewer than
n
(sample size) neurons when the activation is smooth. First, we
prove that as long as the width
m2n/d
(where
d
is the input dimension), its
expressivity is strong, i.e., there exists at least one global minimizer with zero
training loss. Second, we identify a nice local region with no local-min or saddle
points. Nevertheless, it is not clear whether gradient descent can stay in this nice re-
gion. Third, we consider a constrained optimization formulation where the feasible
region is the nice local region, and prove that every KKT point is a nearly global
minimizer. It is expected that projected gradient methods converge to KKT points
under mild technical conditions, but we leave the rigorous convergence analysis
to future work. Thorough numerical results show that projected gradient methods
on this constrained formulation significantly outperform SGD for training narrow
neural nets.
1 Introduction
Modern neural networks are huge (e.g. [
8
,
74
]). Reducing the size of neural nets is appealing for
many reasons: first, small networks are more suitable for embedded systems and portable devices;
second, using smaller networks can reduce power consumption, contributing to “green computing”.
Equal contribution. These authors are listed in alphabetical order.
Corresponding author: Ruoyu Sun.
35th Conference on Neural Information Processing Systems (NeurIPS 2021).
arXiv:2210.12001v1 [cs.LG] 21 Oct 2022
There are many ways to reduce network size, such as quantization, sparcification and reducing the
width (e.g. [20, 77]). In this work, we focus on reducing the width (training narrow nets).
Reducing the network width often leads to significantly worse performance
3
. What is the possible
cause? From the theoretical perspective, there are three possible causes: worse generalization power,
worse trainability (how effective a network can be optimized), and weaker expressivity (how complex
the function a network can represent; see Definition 1). Our simulation shows that the training error
deteriorates significantly as the width shrinks, which implies that the trainability and/or expressivity
are important causes of the worse performance (see Section 5.1 for more evidence ). We do not
discuss generalization power for now, and leave it to future work.
So how is the training error related to expressivity and trainability? The training error is the sum of
two parts (see, e.g., [
64
]): the expressive error (which is the best a given network can do; also the
global minimal error) and the optimization error (which is the gap between training error and the
global minimal error; occurs because the algorithm may not find global-min). The two errors are of
different nature, and thus need to be discussed separately.
It is understandable that narrower networks might have weaker expressive power. What about
optimization? There is also evidence that smaller width causes optimization difficulty. A number of
recent works show that increasing the width of neural networks helps create a benign empirical loss
landscape ([
19
,
35
,
59
]), while narrow networks (width
m
< sample size
n
) suffer from bad landscape
([
2
,
60
,
66
,
72
,
79
]). Therefore, if we want to improve the performance of narrow networks, it is
likely that both expressiveness and trainability need to be improved.
The above discussion leads to the following two questions:
(Q1) Can a narrow network have as strong expressivity as a wide one?
(Q2) If so, can a local search method find a (near) globally optimal solution?
The key challenges in answering these questions are listed below:
It is not clear whether a narrow network has strong expressivity or not. Many existing works
focus on verifying the relationship between zero-training-error solutions and stationary
points, but they neglect the (non)existence of such solutions (e.g. [
71
], [
63
]). For narrow
networks, the (non)-existence of zero-training-error-solution is not clear.
Even if zero-training-error solutions do exist, it is still not clear how to reach those solutions
because the landscape of a narrow neural network can be highly non-convex.
Even assuming that we can identify a region that contains zero-training-error solutions and
has a good landscape, it is potentially difficult to keep the iterates inside such a good region.
One may think of imposing an explicit constraint, but this approach might introduce bad
local minimizers on the boundary [6].
In this work, we (partially) answer
(Q1)
and
(Q2)
for a 1-hidden-layer nets with fewer than
n
neurons.
Our main contributions are as follows:
Expressiveness and nice local landscape.
We prove that, as long as the width
m
is larger
than
2n/d
(where
n
is the sample size and
d
is the input dimension), then the expressivity
of the 1-hidden-layer net is strong, i.e., w.p.1. there exists at least one global-min with zero
empirical loss. In addition, such a solution is surrounded by a good local landscape with
no local-min or saddles. Note that our results do not exclude the possibility that there are
sub-optimal local minimizers on the global landscape.
Every KKT point is an approximated global minimizer.
For the original unconstrained
optimization problem, the nice local landscape does not guarantee the global statement
of “every stationary point is a global minimizer”. We propose a constrained optimization
problem that restricts the hidden weights to be close to the identified nice region. We show
3
This can be verified on our empirical studies in Section 5. Another evidence is that structure pruning
(reducing the number of channels in convolutional neural nets (CNN)) is known to achieve worse performance
than unstructured pruning; this is an undesirable situation since many practitioners prefer structure pruning (due
to hardware reasons).
2
that every Karush–Kuhn–Tucker (KKT) point is an approximated global minimizer of the
unconstrained training problem 4.
In real-data experiments, our proposed training regime can significantly outperforms SGD
for training narrow networks. We also perform ablation studies to show that the new elements
proposed in our method are useful.
2 Background and Related Works
The expressivity of neural networks has been a popular topic in machine learning for decades. There
are two lines of works: One focuses on the infinite-sample expressivity, showing what functions of
the entire domain can and cannot be represented by certain classes of neural networks (e.g. [
4
,
45
]).
Another line of works characterize the finite-sample expressivity, i.e. how many parameters are
required to memorize finite samples (e.g. [
5
,
13
,
21
,
25
,
26
,
42
,
75
]). The term “expressivity” in this
work means the finite-sample expressivity; see Definition 1 in Section 4.1. A major research question
regarding expressivity in the area is to show deep neural networks have much stronger expressivity
than the shallow ones (e.g. [
7
,
17
,
41
,
48
,
56
,
57
,
67
,
70
,
72
]). However, all these works neglect the
trainability.
In the finite-sample case, wide networks (width
poly(n)
) have both strong representation power
(i.e. the globally minimal training error is zero) and strong trainability (e.g. for wide enough nets,
Gradient Descent (GD) converges to global minima [
1
,
16
,
28
,
80
]). While these wide networks are
often called “over-parameterized”, we notice that the number of parameters of a width-
n
network
is actually at least
nd
, which is much larger than
n
. If comparing
n
with the number of parameters
(instead of neurons), the transition from under-parameterization and over-parameterization for a
one-hidden-layer fully-connected net (FCN) does not occur at width-
n
, but at width-
n/d
. In this
work, we will analyze networks with width in the range
[n/d, n)
, which we call “narrow networks”
(though rigorously speaking, we shall call them “narrow but still overparameterized networks”).
There are a few works on the trainability of narrow nets (one-hidden-layer networks with
mn/d
neurons). Soudry and Carmon [
63
], Xie et al. [
71
] show that for such networks, stationary points
with full-rank NTK (neural tangent kernel) are zero-loss global minima. However, it is not clear
whether the NTK stays full rank during the training trajectory. In addition, these two works do not
discuss whether a zero-loss global minimizer exists.
There are two interesting related works [
9
,
12
] pointed out by the reviewers. Bubeck et al. [
9
] study
how many neurons are required for memorizing a finite dataset by 1-hidden-layer networks. They
prove the following results. Their first result is an “existence” result: there exists a network with
width
m4n
d
which can memorize
n
input-label pairs (their Proposition 4). However, in this setting
they did not provide an algorithm to find the zero-loss solution. Their second result is related to
algorithms: they proposed a training algorithm that achieves accuracy up to error
for a neural net
with width
mOn
d
log(1/)
. This result requires width dependent on the precision
; for instance,
when the desired accuracy
= 1/n
, the required width is at least
On2
d.
In contrast, in our work,
the required number of neurons is just 2n/d, which is independent of .
Daniely [
12
] also studies the expressivity and trainability of 1-hidden-layer networks. To memorize
n(1 )
random data points via SGD, their required width is
˜
O(n/d)
. They assumed
n=dc
where
c > 0
is a fixed constant (appeared in Sec. 3.3 of [
12
]), in which case the hidden factor in
˜
O
is
O(log[d(log d)]c)
. In other words, if
n, d → ∞
with the scaling
n=dc
for a fixed constant
c
, then
their bound is roughly
O(n/d)
up to a log-factor. Nevertheless, for more general scaling of
n, d
, the
exponent
c= (log n)/(log d)
may not be a constant and the hidden factor may not be a log-factor
(e.g. for fixed
d
and
n→ ∞
). We tracked their proof and find that the width bound for general
n, d
is
O(n/d) (log(dlog n))log n/ log d
, which can be larger than
On2/d
(see detailed computation
and explanation in Appendix C). In contrast, our required width is
2n/d
for arbitrary
n
and
d
. Our
bound is always smaller than nwhen d > 2.
4
This result describes the loss landscape of the constrained optimization problem, not directly related to
algorithm convergence. Nevertheless, it is expected that first-order methods converge to KKT points and
thus approximate global minimizers. A rigorous convergence analysis may require verifying extra technical
conditions, which is left to future work.
3
Additionally, there is a major difference between Daniely [
12
] and our work: they analyze the original
unconstrained problem and SGD; in contrast, we analyze a constrained problem. This may be the
reason why we can get a stronger bound on width. In the experiments in Section 5, we observe that
SGD performs badly when the width is small (see the 1st column in Figure 4 (b)). Therefore, we
suspect an algorithmic change is needed to train narrow nets with such width (due to the training
difficulty), and we indeed propose a new method to train narrow nets.
Due to the space constraints, we defer more related works in Appendix B.
3 Challenges For Analyzing Narrow Nets
In this section, we discuss why it is challenging to achieve expressivity and trainability together for
narrow nets. Consider a dataset {(xi, yi)}n
i=1 Rd×Rand a 1-hidden-layer neural network:
f(x;θ) =
m
X
j=1
vjσwT
jx,(1)
where
σwT
jx
is the output of the
j
-th hidden nodes with hidden weights
wj
,
σ(·)
is the activation
function, and
vj
is the corresponding outer weight (bias terms are ignored for simplicity). To learn
such a neural network, we search for the optimal parameter
θ= (w, v)
by minimizing the following
empirical (training) loss:
min
θ`(θ) = 1
2
n
X
i=1
(yif(xi;θ))2,(2)
The gradient of the above problem w.r.t. hidden weights
w={wi}m
i=1
is given by:
w`(θ) =
J(w;v)T(f(w;v)y)Rmd×1,where J(w;v)Rn×md is the Jacobian matrix w.r.t w:
J(w;v) :=
wf(w;x1, v)T
.
.
.
wf(w;xn, v)T
=
v1σ0wT
1x1xT
1··· vmσ0wT
mx1xT
1
.
.
.
v1σ0wT
1xnxT
n··· vmσ0wT
mxnxT
n
Rn×md.(3)
First order methods like GD converge to a stationary point
θ= (w, v)
(i.e. with zero gradient)
under mild conditions [
6
]. For problem
(2)
, it is easy to show that if (i)
(w, v)
is a stationary point,
(ii)
J(w;v)Rn×md
is full row rank and
nmd
, then
(w, v)
is a global-min (this claim can
be proved by setting the partial gradient of
(2)
over
w
to be zero). In other words, for training a
network with width mn/d, an important tool is to ensure the full-rankness of the Jacobian.
Recent works have shown that it is possible to guarantee the full rankness of the Jacobian matrix
along the training trajectories, however, the required width is above
Ω(poly(n))
. Roughly speaking,
the proof sketch is the following: (i) with high probability,
J(w;v)
is non-singular locally around
the random initialization ([
71
], [
63
]), (ii) increasing the width can effectively bound the parameter
movement from initialization, so the “nice” property of non-singualr
J(w, v)
holds throughout the
training, leading to a linear convergence rate [
16
]. Under this general framework, a number of
convergence results are developed for wide networks with width
Ω(poly(n))
([
1
,
11
,
28
,
34
,
49
,
53
55, 80, 81]). This idea is also illustrated in Figure 1 (a).
We notice that there is a huge gap between the necessary condition
mn/d
and the common
condition
mΩ(poly(n)).
We suspect that it is possible to train a narrow net with width
Θ(n/d)
to
small loss. To achieve this goal, we need to understand why existing arguments require a large width
and cannot apply to a network with width Θ(n/d).
The first reason is about trainability. The above arguments no longer hold when the width is not large
enough to control the movement of hidden weights. In this case, the iterates may easily travel far
away from the initial point and get stuck at some singular-Jacobian critical points with high training
loss (see Figure 1 (b). Also see Figure 4 (b) & (d) for more empirical evidence). In other words, GD
may get stuck at sub-optimal stationary points for narrow nets.
The second reason, and also an easily ignored one, is the expressivity (a.k.a. the representation power,
see Definition 1 for a formal statement). In above discussion, we implicitly assumed that there exists
a zero-loss global minimizer, which is equivalent to “there exists a network configuration such that
the network can memorize the data”. For networks with width at least
n,
this assumption can be
4
(a) Wide networks (b) Narrow networks
(c) Narrow networks (our regime)
Figure 1: The parameter movement under different regimes. The shaded area indicates the region
where
J(w;v)
is non-singular, the black circle denotes the region that GD iterates will explore, and
the red circle is the constraint designed in our training regime, it will be discussed in Section 4.3.
justified in the following way. The feature matrix
Φ(w) :=
σwT
1x1,...,σwT
mx1
.
.
.
σwT
1xn,...,σwT
mxn
Rn×m(4)
can span the whole space
Rn
when it is full rank and
mn
, thus the network can perfectly fit any
label
yRn
even without training the hidden layer. It is important to note that when the width
m
is
below the sample size
n
, full row-rankness does not ensure that the row space of the feature matrix is
the whole space Rn. In other words, it is not clear whether a global-min with zero loss exists.
In the next section, we will describe how we obtain strong expressive power with
Θ(n/d)
neurons,
and how to avoid sub-optimal stationary points.
4 Main Results
4.1 Problem Settings and Preliminaries
We denote {(xi, yi)}n
i=1 Rd×Ras the training samples, where xiRd,yiR. For theoretical
analysis, we focus on 1-hidden-layer neural networks
f(x;θ) = Pm
j=1 vjσwT
jxR
, where
σ(·)
is the activation function,
wjRd
and
vjR
are the parameters to be trained. Note that we only
consider the case where f(x;θ)Rhas 1-dimensional output for notation simplicity.
To learn such a neural network, we search for the optimal parameter
θ= (w, v)
by mini-
mizing the empirical loss (2), and sometimes we also use
`(w;v)
or
f(w;x, v)
to emphasize
the role of
w
. We use the following shorthanded notations:
x:= (xT
1;. . . ;xT
n)Rn×d
,
y:= (y1, . . . , yn)TRn
,
w:= (w1, . . . , wm)m
j=1 Rd×m
,
v:= (v1, . . . , vm)m
j=1 Rm
, and
f(w;v) := (f(x1;w, v), f(x2;w, v), . . . , f(xn;w, v))TRn
. We denote the Jacobian matrix of
f(w;v)
w.r.t
w
as
J(w;v)
, which can be seen in
(3)
. We define the feature matrix
Φ(w)
as in
(4)
.
We denote the operator
w
as “taking the gradient w.r.t.
w
”, and the same goes for
v
. Throughout
the paper, ‘w.p.1’ is the abbreviation for ‘with probability one’; when we say ‘in the neighborhood of
initialization’, it means ‘wis in the neighborhood of the initialization w0’.
Now, we formally define the term “expressivity”. As discussed in Section 2, we focus on the
finite-sample (as opposed to infinite-sample) expressivity, which is relevant in practical training.
Definition 1.
(Expressivity) We say a neural net function class
F={f(x;θ); θΘ}
has strong
(
n
-sample) expressivity if for any
n
input-output pairs
D={(xi, yi)}n
i=1 Rd×R
where
xi
s are
distinct, there exists a
ˆ
θ(D)Θ
such that
f(xi;ˆ
θ(D)) = yi
,
i= 1,··· , n
. Or equivalently, the
optimal value of empirical loss
(2)
equals 0 for any
D
. Sometimes we may drop the word “strong”
for brevity.
Figure 2: A simple example of the mir-
rored LeCun’s initialization.
Next, let us describe the mirrored LeCun’s initialization
in Algorithm 1. The idea is that through this initialization,
the hidden outputs will cancel out with the outer weights,
so that we get zero initial output for any input
x
; see Fig-
ure 2 for a simple illustration. Note that similar symmetric
initialization strategies are also proposed in some recent
works such as [
11
] and [
12
]. However, our purpose is dif-
ferent. More explanation can be seen in the final paragraph
of Section 4.2.
5
摘要:

WhenExpressivityMeetsTrainability:FewerthannNeuronsCanWorkJiaweiZhangShenzhenResearchInstituteofBigDataTheChineseUniversityofHongKong,Shenzhen,Chinajiaweizhang2@link.cuhk.edu.cnYushunZhangShenzhenResearchInstituteofBigDataTheChineseUniversityofHongKong,Shenzhen,Chinayushunzhang@link.cuhk.edu.cnMin...

展开>> 收起<<
When Expressivity Meets Trainability Fewer than n Neurons Can Work Jiawei Zhang.pdf

共39页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:39 页 大小:1.35MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 39
客服
关注